Commit Graph

259 Commits

Author SHA1 Message Date
Tony Hutter
9a2e90c9fc Revert "Handle zap_add() failures in mixed ... "
This reverts commit cc63068e95.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401
Closes #7416
2018-04-09 17:29:59 -04:00
Paul Zuchowski
0a0af41bd9 zdb and inuse tests don't pass with real disks
Due to zpool create auto-partioning in Linux (i.e. sdb1),
certain utilities need to use the parition (sdb1) while
others use the whole disk name (sdb).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #6939
Closes #7261
2018-03-14 16:10:38 -07:00
Wolfgang Bumiller
3808006edf Take user namespaces into account in policy checks
Change file related checks to use user namespaces and make
sure involved uids/gids are mappable in the current
namespace.

Note that checks without file ownership information will
still not take user namespaces into account, as some of
these should be handled via 'zfs allow' (otherwise root in a
user namespace could issue commands such as `zpool export`).

This also adds an initial user namespace regression test
for the setgid bit loss, with a user_ns_exec helper usable
in further tests.

Additionally, configure checks for the required user
namespace related features are added for:
  * ns_capable
  * kuid/kgid_has_mapping()
  * user_ns in cred_t

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Closes #6800
Closes #7270
2018-03-14 16:10:38 -07:00
John Eismeier
33bb1e8256 Fix some typos
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: John Eismeier <john.eismeier@gmail.com>
Closes #7237
2018-03-14 16:10:38 -07:00
Tony Hutter
99920d823e Add scrub after resilver zed script
* Add a zed script to kick off a scrub after a resilver.  The script is
disabled by default.

* Add a optional $PATH (-P) option to zed to allow it to use a custom
$PATH for its zedlets.  This is needed when you're running zed under
the ZTS in a local workspace.

* Update test scripts to not copy in all-debug.sh and all-syslog.sh by
default.  They can be optionally copied in as part of zed_setup().
These scripts slow down zed considerably under heavy events loads and
can cause events to be dropped or their delivery delayed. This was
causing some sporadic failures in the 'fault' tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4662
Closes #7086
2018-03-14 16:10:37 -07:00
Giuseppe Di Natale
d5b10b3ef3 Correct count_uberblocks in mmp.kshlib
A log_must call was causing count_uberblocks to return more
than just the uberblock count. Remove the log_must since it
was only logging a sleep.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7191
2018-03-14 16:10:37 -07:00
sanjeevbagewadi
b3da003ebf Handle zap_add() failures in mixed case mode
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #7011
Closes #7054
2018-03-14 16:10:37 -07:00
Chunwei Chen
478754a8f5 Fix zdb -ed on objset for exported pool
zdb -ed on objset for exported pool would failed with:
  failed to own dataset 'qq/fs0': No such file or directory

The reason is that zdb pass objset name to spa_import, it uses that
name to create a spa. Later, when dmu_objset_own tries to lookup the spa
using real pool name, it can't find one.

We fix this by make sure we pass pool name rather than objset name to
spa_import.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
Closes #6464
2018-03-14 16:10:37 -07:00
LOLi
6c891ade8b ZTS: Fix create-o_ashift test case
The function that fills the uberblock ring buffer on every device label
has been reworked to avoid occasional failures caused by a race
condition that prevents 'zpool sync' from writing some uberblock
sequentially: this happens when the pool sync ioctl dispatch code calls
txg_wait_synced() while we're already waiting for a TXG to sync.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6924
Closes #6977
2018-03-14 16:10:36 -07:00
Brian Behlendorf
5d62588032 Remove vn_rename and vn_remove dependency
The only place vn_rename and vn_remove are used is when writing
out an updated pool configuration file.  By truncating the file
instead of renaming and removing it we can avoid having to implement
these interfaces entirely.  Functionally an empty cache file is
treated the same as a missing cache file.  This is particularly
advantageous because the Linux kernel has never provided a way
to reliably implement vn_rename and vn_remove.

The cachefile_004_pos.ksh test case was updated to understand
that an empty cache file is the same as a missing one.

The zfs-import-* systemd service files were not updated to use
ConditionFileNotEmpty in place of ConditionPathExists.  This
means that after exporting all pools and rebooting new pools
will not the scanned for on the next boot.  This small change
should not impact normal usage since pools are not exported
as part of a normal shutdown.

Documentation was updated accordingly.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#648
Closes #6753
2018-03-14 16:10:36 -07:00
Brian Behlendorf
184087f822 Add configure option to enable gcov analysis
* Add configure option to enable gcov analysis.
* Includes a few minor ctime fixes.
* Add codecov.yml configuration.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6642
2018-03-14 16:10:36 -07:00
LOLi
2f62fdd644 Fix 'zfs receive -o' when used with '-e|-d'
When used in conjunction with one of '-e' or '-d' zfs receive options
none of the properties requested to be set (-o) are actually applied:
this is caused by a wrong assumption made about the toplevel dataset
in zfs_receive_one().

Fix this by correctly detecting the toplevel dataset.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7088

Requires-spl: refs/pull/679/head
2018-01-30 10:27:32 -06:00
Brian Behlendorf
9d1a39cec6 Fix shellcheck v0.4.6 warnings
Resolve new warnings reported after upgrading to shellcheck
version 0.4.6.  This patch contains no functional changes.

* egrep is non-standard and deprecated. Use grep -E instead. [SC2196]
* Check exit code directly with e.g. 'if mycmd;', not indirectly
  with $?.  [SC2181]  Suppressed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7040
2018-01-30 10:27:31 -06:00
LOLi
a8fa31b50b Fix 'zpool add' handling of nested interior VDEVs
When replacing a faulted device which was previously handled by a spare
multiple levels of nested interior VDEVs will be present in the pool
configuration; the following example illustrates one of the possible
situations:

   NAME                          STATE     READ WRITE CKSUM
   testpool                      DEGRADED     0     0     0
     raidz1-0                    DEGRADED     0     0     0
       spare-0                   DEGRADED     0     0     0
         replacing-0             DEGRADED     0     0     0
           /var/tmp/fault-dev    UNAVAIL      0     0     0  cannot open
           /var/tmp/replace-dev  ONLINE       0     0     0
         /var/tmp/spare-dev1     ONLINE       0     0     0
       /var/tmp/safe-dev         ONLINE       0     0     0
   spares
     /var/tmp/spare-dev1         INUSE     currently in use

This is safe and allowed, but get_replication() needs to handle this
situation gracefully to let zpool add new devices to the pool.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6678
Closes #6996
2018-01-30 10:27:31 -06:00
Giuseppe Di Natale
c2aacf2087 Handle broken pipes in arc_summary
Using a command similar to 'arc_summary.py | head' causes
a broken pipe exception. Gracefully exit in the case of a
broken pipe in arc_summary.py.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6965 
Closes #6969
2018-01-30 10:27:31 -06:00
LOLi
9a6c57845a Handle invalid options in arc_summary
If an invalid option is provided to arc_summary.py we handle any error
thrown from the getopt Python module and print the usage help message.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6983
2018-01-30 10:27:31 -06:00
David Quigley
53e5890cff Fix bug in distclean which removes needed files
Running distclean removes the following files because of an error
in Makefile.am

deleted:    tests/zfs-tests/include/commands.cfg
deleted:    tests/zfs-tests/include/libtest.shlib
deleted:    tests/zfs-tests/include/math.shlib
deleted:    tests/zfs-tests/include/properties.shlib
deleted:    tests/zfs-tests/include/zpool_script.shlib

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #6636
2018-01-30 10:27:30 -06:00
Brian Behlendorf
504bfc8b49 Fix multihost stale cache file import
When the multihost property is enabled it should be impossible to
import an active pool even using the force (-f) option.  This patch
prevents a forced import from succeeding when importing with a
stale cache file.

The root cause of the problem is that the kernel modules trusted
the hostid provided in configuration.  This is always correct when
the configuration is generated by scanning for the pool.  However,
when using an existing cache file the hostid could be stale which
would result in the activity check being skipped.

Resolve the issue by always using the hostid read from the label
configuration where the best uberblock was found.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6933
Closes #6971
2017-12-18 10:31:01 -08:00
Olaf Faaland
53a8cbd70e Fix ZTS MMP tests and ztest -M behavior
Quote "$MMP_IMPORT_MSG" when it is passed as an argument, as it is a
multi-word string.  Some tests were passing when they should not have,
because the grep was only testing for the first word.

Correct the message expected when no hostid is set and the test attempts
to enable multihost.  It did not match the actual output in that
situation.

Disable ztest_reguid() when ztest is invoked with the -M option.  If
ztest performs a reguid, a concurrent import attempt may fail with the
error "one or more devices is currently unavailable" if the guid sum is
calculated on the original device guids but compared against the guid
sum ztest wrote based on the new device guids.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6666
2017-12-18 10:14:39 -08:00
Giuseppe Di Natale
cf21b5b5b2 Allow test-runner to filter test groups by tag
Enable test-runner to accept a list of tags to identify
which test groups the user wishes to run.

Also allow test-runner to perform multiple iterations
of a test run.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6788
2017-12-06 13:25:40 -06:00
Brian Behlendorf
ddd20dbe0b Fix 'zpool create|add' replication level check
When the pool configuration contains a hole due to a previous device
removal ignore this top level vdev.  Failure to do so will result in
the current configuration being assessed to have a non-uniform
replication level and the expected warning will be disabled.

The zpool_add_010_pos test case was extended to cover this scenario.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6907
Closes #6911
2017-12-04 17:21:39 -08:00
LOLi
6db8f1a0d1 Fix 'zfs get {user|group}objused@' functionality
Fix a regression accidentally introduced in 1b81ab4 that prevents
'zfs get {user|group}objused@' from correctly reporting the requested
value.

Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify
this functionality.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6908
2017-12-04 17:21:39 -08:00
Brian Behlendorf
954516cec1 Emit history events for 'zpool create'
History commands and events were being suppressed for the
'zpool create' command since the history object did not
yet exist.  Create the object earlier so this history
doesn't get lost.

Split the pool_destroy event in to pool_destroy and
pool_export so they may be distinguished.

Updated events_001_pos and events_002_pos test cases.  They
now check for the expected history events and were reworked
to be more reliable.

Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6712
Closes #6486
Conflicts:
	tests/zfs-tests/tests/functional/events/events_002_pos.ksh
2017-12-04 17:21:03 -08:00
LOLi
fedc1d96a8 Fix truncate(2) mtime and ctime handling
On Linux, ftruncate(2) always changes the file timestamps, even if the
file size is not changed. However, in case of a successfull
truncate(2), the timestamps are updated only if the file size changes.
This translates to the VFS calling the ZFS Posix Layer "setattr"
function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally
set on the iattr mask only when doing a ftruncate(2), while the
truncate(2) is left to the filesystem implementation to be dealt with.

This behaviour is consistent with POSIX:2004/SUSv3 specifications
where there's no explicit requirement for file size changes to update
the timestamps only for ftruncate(2):

http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html
http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html

This has been later updated in POSIX:2008/SUSv4 where, for both
truncate(2)/ftruncate(2), there's no mention of this size change
requirement:

http://austingroupbugs.net/view.php?id=489
http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

Unfortunately the Linux VFS is still calling into the ZPL without
ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by
explicitly updating the timestamps when detecting the ATTR_SIZE bit,
which is always set in do_truncate(), on the iattr mask.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6811
Closes #6819
2017-11-21 13:11:29 -06:00
Giuseppe Di Natale
c45254b0ec Correct flake8 errors after STYLE builder update
Fix new flake8 errors related to bare excepts and ambiguous
variable names due to a STYLE builder update.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6776
2017-11-20 16:19:23 -06:00
Brian Behlendorf
6e893ef62a Fix chattr/cleanup failure
The chattr cleanup step may fail to delete the user if there is still
an active process running as that user.  Retry the userdel when this
occurs to eliminate spurious false positves.

  ERROR: userdel quser1 exited 8
  userdel: user quser1 is currently used by process 26814

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6749
2017-10-17 16:48:58 -07:00
LOLi
926c6ec453 Fix intra-pool resumable 'zfs send -t <token>'
Because resuming from a token requires "guid" -> "snapshot" mapping
we have to walk the whole dataset hierarchy to find the right snapshot
to send; when both source and destination exists, for an incremental
resumable stream, libzfs gets confused and picks up the wrong snapshot
to send from: this results in attempting to send

   "destination@snap1 -> source@snap2"

instead of

   "source@snap1 -> source@snap2"

which fails with a "Invalid cross-device link" error (EXDEV).

Fix this by adjusting the logic behind dataset traversal in
zfs_iter_children() to pick the right snapshot to send from.

Additionally update dry-run 'zfs send -t' to print its output to
stderr: this is consistent with other dry-run commands.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6618
Closes #6619
Closes #6623
2017-10-16 10:57:55 -07:00
Brian Behlendorf
91b2f6ab1c Fix ARC behavior on 32-bit systems
With the addition of the ABD changes consumption of the virtual
address space has been greatly reduced.  This exposed an issue on
CONFIG_HIGHMEM systems where free memory was being calculated
incorrectly.  Functionally this didn't cause any major problems
prior to ABD because a lack of available virtual address space
was used as an indicator of low memory.

This patch makes the following changes to address the issue and
in the process realigns the code further with OpenZFS.  There
are no substantive changes in behavior for 64-bit systems.

* Added CONFIG_HIGHMEM case to the arc_all_memory() and
  arc_free_memory() functions to only consider low memory pages
  on CONFIG_HIGHMEM systems.

* The arc_free_memory() function was updated to return bytes
  instead of pages to be consistent with the other helper
  functions.  In user space we make up some reasonable values
  since currently only testing is performed in this context.

* Adds three new values to the arcstats kstat to provide visibility
  in to the ARC's assessment of the memory situation:
  memory_all_bytes, memory_free_bytes, and memory_available_bytes.

* Added kmem_reap() call to arc_available_memory() for 32-bit
  builds to realign code with OpenZFS.

* Reduced size of test file in /async_destroy_001_pos.ksh to
  speed up test case.  Multiple txgs are still required.

* Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos
  to TEST_BASE_DIR location to speed up test cases.

Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5352
Closes #6734
2017-10-16 10:57:55 -07:00
Ned Bass
4cfc086e4d receive_freeobjects() skips freeing some objects
When receiving a FREEOBJECTS record, receive_freeobjects()
incorrectly skips a freed object in some cases. Specifically, this
happens when the first object in the range to be freed doesn't exist,
but the second object does. This leaves an object allocated on disk
on the receiving side which is unallocated on the sending side, which
may cause receiving subsequent incremental streams to fail.

The bug was caused by an incorrect increment of the object index
variable when current object being freed doesn't exist.  The
increment is incorrect because incrementing the object index is
handled by a call to dmu_object_next() in the increment portion of
the for loop statement.

Add test case that exposes this bug.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6694
Closes #6695
2017-10-16 10:57:55 -07:00
Brian Behlendorf
4e6a9e4598 ZTS fix slog_replay_volume.ksh failure
The slog_replay_volume.ksh test case will fail when the pool is
layered on files in a filesystem which does not support discard.
Avoid this issue by creating the pool using DISKS which will
either be loopback device or real disk.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6654
2017-09-20 10:25:54 -07:00
Gaurav Kumar
d3e7d981d4 Modifying XATTRs doesnt change the ctime
Changing any metadata, should modify the ctime.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #3644
Closes #6586
2017-09-13 16:05:18 -07:00
Brian Behlendorf
a2a0440918 Fix volume WR_INDIRECT log replay (#6620)
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6603
2017-09-13 16:04:16 -07:00
Giuseppe Di Natale
45d1abc74d Improved dnode allocation and dmu_hold_impl() (#6611)
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.

* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately.  This
eliminates most lock contention when taking the hold and creating the
dnode.

* Correct detection of the chunk boundary to work properly with large
dnodes.

* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.

* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk.  Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.

* Add several kstats to allow the behavior of the code to be examined.

* Verify dnode packing in large_dnode_008_pos.ksh.  Since the test is
purely creates, it should leave very few holes in the metadnode.

* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.

With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.

name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    3804690
dnode_hold_alloc_misses         4    216
dnode_hold_alloc_interior       4    3
dnode_hold_alloc_lock_retry     4    0
dnode_hold_alloc_lock_misses    4    0
dnode_hold_alloc_type_none      4    0
dnode_hold_free_hits            4    203105
dnode_hold_free_misses          4    4
dnode_hold_free_lock_misses     4    0
dnode_hold_free_lock_retry      4    0
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    57
dnode_hold_free_txg             4    0
dnode_allocate                  4    203154
dnode_reallocate                4    0
dnode_buf_evict                 4    23918
dnode_alloc_next_chunk          4    4887
dnode_alloc_race                4    0
dnode_alloc_next_block          4    18

The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-09-13 15:46:15 -07:00
LOLi
ae5b4a05ff Fix range locking in ZIL commit codepath
Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
offset and length to the offset and length of the BIO from
zvol_write()->zvol_log_write(): these offset and length are later used
to take a range lock in zillog->zl_get_data function: zvol_get_data().

Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
offset 0: we will only be range-locking 0-4096. This means the
ASSERTion we make in dbuf_unoverride() is no longer valid because now
dmu_sync() is called from zilog's get_data functions holding a partial
lock on the dbuf.

Fix this by taking a range lock on the whole block in zvol_get_data().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6238 
Closes #6315 
Closes #6356 
Closes #6477
2017-08-21 16:46:54 -07:00
LOLi
3468fdbd34 Fix remounting snapshots read-write
It's not enough to preserve/restore MS_RDONLY on the superblock flags
to avoid remounting a snapshot read-write: be explicit about our
intentions to the VFS layer so the readonly bit is updated correctly
in do_remount_sb().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6510
Closes #6515
2017-08-21 16:46:52 -07:00
Brian Behlendorf
fb3f1fdbd6 Fix ZTS grow_pool/setup
The addition of the large_dnode_008_pos test case, which runs
right before this one, exposed some racy behavior in grow_pool
setup.sh on the Ubuntu kmemleak builder.  Before creating
partitions on a device destroying any existing ones.

  ERROR: set_partition 1  100mb loop0 exited 1

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6499 
Closes #6516
2017-08-21 16:41:22 -07:00
Brian Behlendorf
751941e248 Fix dnode allocation race
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6414
Closes #6439
2017-08-08 10:17:33 -07:00
Giuseppe Di Natale
12acabe2a4 mmp_on_uberblocks: Use kstat for uberblock counts
Use kstat to get a more accurate count of uberblock updates.
Using a loop with zdb can potentially miss some uberblocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6407
Closes #6419
2017-08-02 11:21:33 -07:00
LOLi
20c88dc3ef Fix volmode=none property behavior at import time
At import time spa_import() calls zvol_create_minors() directly: with
the current implementation we have no way to avoid device node
creation when volmode=none.

Fix this by enforcing volmode=none directly in zvol_alloc().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6426
2017-08-02 11:21:14 -07:00
Giuseppe Di Natale
affb7141d7 Disable zfs_send_007_pos
Test case zfs_send_007_pos regularly is killed
by test-runner during zfs-tests on buildbot. Disable
it for now until further investigation can be done.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6422
2017-08-02 11:20:32 -07:00
Olaf Faaland
e889f0f520 Report MMP_STATE_NO_HOSTID immediately
There is no need to perform the activity check before detecting that the
user must set the system hostid, because the pool's multihost property
is on, but spa_get_hostid() returned 0.  The initial call to
vdev_uberblock_load() provided the information required.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6388
2017-07-25 13:22:28 -04:00
Olaf Faaland
0582e40322 Add callback for zfs_multihost_interval
Add a callback to wake all running mmp threads when
zfs_multihost_interval is changed.

This is necessary when the interval is changed from a very large value
to a significantly lower one, while pools are imported that have the
multihost property enabled.

Without this commit, the mmp thread does not wake up and detect the new
interval until after it has waited the old multihost interval time.  A
user monitoring mmp writes via the provided kstat would be led to
believe that the changed setting did not work.

Added a test in the ZTS under mmp to verify the new functionality is
working.

Added a test to ztest which starts and stops mmp threads, and calls into
the code to signal sleeping mmp threads, to test for deadlocks or
similar locking issues.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6387
2017-07-25 13:22:20 -04:00
Olaf Faaland
b9373170e3 Add zgenhostid utility script
Turning the multihost property on requires that a hostid be set to allow
ZFS to determine when a foreign system is attemping to import a pool.
The error message instructing the user to set a hostid refers to
genhostid(1).

Genhostid(1) is not available on SUSE Linux.  This commit adds a script
modeled after genhostid(1) for those users.

Zgenhostid checks for an /etc/hostid file; if it does not exist, it
creates one and stores a value.  If the user has provided a hostid as an
argument, that value is used.  Otherwise, a random hostid is generated
and stored.

This differs from the CENTOS 6/7 versions of genhostid, which overwrite
the /etc/hostid file even though their manpages state otherwise.

A man page for zgenhostid is added. The one for genhostid is in (1), but
I put zgenhostid in (8) because I believe it's more appropriate.

The mmp tests are modified to use zgenhostid to set the hostid instead
of using the spl_hostid module parameter.  zgenhostid will not replace
an existing /etc/hostid file, so new mmp_clear_hostid calls are
required.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6358
Closes #6379
2017-07-25 13:22:03 -04:00
Olaf Faaland
ffb195c256 Release SCL_STATE in map_write_done()
The config lock must be held for the duration of the MMP write.
Since the I/Os are executed via map_nowait(), the done function
is the only place where we know the write has completed.

Since SCL_STATE is taken as reader, overlapping I/Os do not
create a deadlock.  The refcount is simply increased when new
I/Os are queued and decreased when I/Os complete.

Test case added which exercises the probe IO call path to
verify the fix and prevent a regression.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6394
2017-07-25 12:25:05 -04:00
Giuseppe Di Natale
f6837d9b53 Increase delay for zed log in events tests
In zed event test cases, a brief delay was introduced
to allow for events to make it to the zed log. On at least
one buildbot builder, the 1 second delay is not long enough.
Therefore, increasing the delay should ensure the zed has
more than enough time to write to its log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6395
2017-07-24 13:02:42 -07:00
Giuseppe Di Natale
39554216df zfs_mount_001_neg: use log_must_busy in cleanup
Use log_must_busy when destroying the snapshot
and dataset during cleanup in zfs_mount_001_neg.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6382
2017-07-24 11:10:25 -07:00
Giuseppe Di Natale
0c656a964d Disable nbmand tests on kernels w/o support
This change allows mountpoint_003_pos and send-c_props
to run on Linux kernels that do not support mandatory
locking. Linux kernel versions greater than or equal to
4.4 no longer support mandatory locking and the test
suite will now account for that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6346 
Closes #6347 
Closes #6362
2017-07-24 11:03:50 -07:00
Tony Hutter
c89a02a26a Add new fsck return code to zvol_misc_002_pos
zvol_misc_002_pos was failing on Fedora 26 because its newer version
of fsck was returning a different code than previous versions.  The
new fsck error code is valid and is been added to the test in this
patch.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6350
2017-07-24 10:58:14 -07:00
Olaf Faaland
379ca9cf2b Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP.  When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp.  Property defaults to off.

During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock.  Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".

Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval.  The period is specified in milliseconds.  The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.

Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path.  Abbreviated
output below.

$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg   timestamp  mmp_delay   vdev_guid   vdev_label vdev_path
20468    261337  250274925   68396651780       3    /dev/sda
20468    261339  252023374   6267402363293     1    /dev/sdc
20468    261340  252000858   6698080955233     1    /dev/sdx
20468    261341  251980635   783892869810      2    /dev/sdy
20468    261342  253385953   8923255792467     3    /dev/sdd
20468    261344  253336622   042125143176      0    /dev/sdab
20468    261345  253310522   1200778101278     2    /dev/sde
20468    261346  253286429   0950576198362     2    /dev/sdt
20468    261347  253261545   96209817917       3    /dev/sds
20468    261349  253238188   8555725937673     3    /dev/sdb

Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.

When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test.  For example, the
pool is exported to run zdb and then imported again.  Add a new ztest
function, "-M", to alter ztest behavior to prevent this.

Add new tests to verify the new functionality.  Tests provided by
Giuseppe Di Natale.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-13 13:54:00 -04:00
LOLi
cf8738d853 Add port of FreeBSD 'volmode' property
The volmode property may be set to control the visibility of ZVOL
block devices.

This allow switching ZVOL between three modes:
   full - existing fully functional behaviour (default)
   dev  - hide partitions on ZVOL block devices
   none - not exposing volumes outside ZFS

Additionally the new zvol_volmode module parameter can be used to
control the default behaviour.

This functionality can be used, for instance, on "backup" pools to
avoid cluttering /dev with unneeded zd* devices.

Original-patch-by: mav <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb
Closes #1796 
Closes #3438 
Closes #6233
2017-07-12 13:05:37 -07:00