mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-14 15:34:53 +03:00

Author	SHA1	Message	Date
Roman Strashkin	234234ca4d	Panic when running 'zpool split' Added missing remove of detachable VDEV from txg's DTL list to avoid use-after-free for the split VDEV Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Closes #5565 Closes #7856	2019-03-22 13:11:36 -07:00
George Wilson	2efea7c82c	ZFS Reads may result in unneccesary calls to zil_commit ZFS supports O_RSYNC for read operations and when specified will ensure the same level of data integrity that O_DSYNC and O_SYNC provides for writes. O_RSYNC by itself has no effect so it must be combined with either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect and causes unnecessary calls to zil_commit. Only platforms which support O_RSYNC should implement the zil_commit functionality in the read code path. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8523	2019-03-22 13:09:11 -07:00
Olaf Faaland	060f0226e6	MMP interval and fail_intervals in uberblock When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. It calculates the activity test duration differently depending on whether the new fields are present or not; for importing pools with only ub_mmp_delay, it uses (zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals Which results in an activity test duration less sensitive to the leaf count. In addition, it makes a few other improvements: * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * Update tests to verify activity check duration is based on recorded tunable values, not tunable values on importing host. * Update tests to verify the expected number of uberblocks have valid MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number), that sequence number is incrementing, and that uberblock values match tunable settings. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7842	2019-03-21 12:47:57 -07:00
Jorgen Lundman	d10b2f1d35	Mutex leak in dsl_dataset_hold_obj() In addition to dsl_dataset_evict_async() releasing a hold, there is an error case in dsl_dataset_hold_obj() which had missed 4 additional release calls. This was introduced in `a1d477c24`. openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c Authored by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8517	2019-03-21 10:36:58 -07:00
cfzhu	45001b949c	QAT: Allocate digest_buffer using QAT_PHYS_CONTIG_ALLOC() If the buffer 'digest_buffer' is allocated in the qat_checksum() stack, it can't ensure that the address is physically contiguous, and the DMA result of the buffer may be handled incorrectly. Using QAT_PHYS_CONTIG_ALLOC() ensures a physically contiguous allocation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chengfei, Zhu <chengfeix.zhu@intel.com> Closes #8323 Closes #8521	2019-03-21 10:35:18 -07:00
Brian Behlendorf	ec4f9b8f30	Report holes when there are only metadata changes Update the dirty check in dmu_offset_next() such that dnode's are only considered dirty for the purpose or reporting holes when there are pending data blocks or frees to be synced. This ensures that when there are only metadata updates to be synced (atime) that holes are reported. Reviewed-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6958 Closes #8505	2019-03-21 10:30:15 -07:00
Julian Heuking	304d469dcd	Add missing dmu_zfetch_fini() in dnode_move_impl() As it turns out, on the Windows platform when rw_init() is called (rather its bedrock call ExInitializeResourceLite) it is placed on an active-list of locks, and is removed at rw_destroy() time. dnode_move() has logic to copy over the old-dnode to new-dnode, including calling dmu_zfetch_init(new-dnode). But due to the missing dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the memory (and in debug builds fill pattern 0xdeadbeef) over the Windows active-lock's prev/next list pointers, making Windows sad. But on other platforms, the contents of dmu_zfetch_fini() is one call to list_destroy() and one to rw_destroy(), which is effectively a no-op call and is not required. This commit is mostly for "correctness" and can be skipped there. Porting Notes: * This leak exists on Linux but currently can never happen because the dnode_move() functionality is not supported. openzfsonosx-commit: openzfsonosx/zfs@d95fe517 Authored by: Julian Heuking <JulianH@beckhoff.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #8519	2019-03-20 15:06:55 -07:00
Brian Behlendorf	ca6c7a94c9	Fix l2arc_evict() destroy race When destroying an arc_buf_hdr_t its identity cannot be discarded until it is entirely undiscoverable. This not only includes being unhashed, but also being removed from the l2arc header list. Discarding the header's identify prematurely renders the hash lock useless because it will always hash to bucket zero. This change resolves a race with l2arc_evict() by discarding the identity after it has been removed from the l2arc header list. This ensures either the header is not on the list or contains the correct identify. Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7688 Closes #8144	2019-03-15 14:17:38 -07:00
Tom Caputi	ab7615d92c	Multiple DVA Scrubbing Fix Currently, there is an issue in the sequential scrub code which prevents self healing from working in some cases. The scrub code will split up all DVA copies of a bp and issue each of them separately. The problem is that, since each of the DVAs is no longer associated with the others, the self healing code doesn't have the opportunity to repair problems that show up in one of the DVAs with the data from the others. This patch fixes this issue by ensuring that all IOs issued by the sequential scrub code include all DVAs. Initially, only the first DVA of each is attempted. If an issue arises, the IO is retried with all available copies, giving the self healing code a chance to correct the issue. To test this change, this patch also adds the ability for zinject to specify individual DVAs to inject read errors into. We then add a new test case that utilizes this functionality to ensure scrubs and self-healing reads can handle and transparently fix issues with individual copies of blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8453	2019-03-15 14:14:31 -07:00
Tony Hutter	2bbec1c910	Make zpool status counters match error events count The number of IO and checksum events should match the number of errors seen in zpool status. Previously there was a mismatch between the two counts because zpool status would only count unrecovered errors, while zpool events would get an event for all errors (recovered or not). This lead to situations where disks could be faulted for "too many errors", while at the same time showing zero errors in zpool status. This fixes the zpool status error counters to increment at the same times we post the error events. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4851 Closes #7817	2019-03-14 18:21:53 -07:00
Tom Caputi	04a3b0796c	Fix memory leaks in zfsvfs_create_impl() This patch simply fixes some small memory leaks that can happen during error handling in zfsvfs_create_impl(). If the function fails, it frees all the memory / references it created. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8490	2019-03-14 18:14:36 -07:00
Tom Caputi	eaed840542	Better user experience for errata 4 This patch attempts to address some user concerns that have arisen since errata 4 was introduced. * The errata warning has been made less scary for users without any encrypted datasets. * The errata warning now clears itself without a pool reimport if the bookmark_v2 feature is enabled and no encrypted datasets exist. * It is no longer possible to create new encrypted datasets without enabling the bookmark_v2 feature, thus helping to ensure that the errata is resolved. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue ##8308 Closes #8504	2019-03-14 16:48:30 -07:00
Alexander Motin	1af240f3b5	Add separate aggregation limit for non-rotating media Before sequential scrub patches ZFS never aggregated I/Os above 128KB. Sequential scrub bumped that to 1MB, supposedly to reduce number of head seeks for spinning disks. But for SSDs it makes little to no sense, especially on FreeBSD, where due to MAXPHYS limitation device will likely still see bunch of 128KB I/Os instead of one large. Having more strict aggregation limit for SSDs allows to avoid allocation of large memory buffer and copy to/from it, that is a serious problem when throughput reaches gigabytes per second. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8494	2019-03-13 12:00:10 -07:00
Tom Caputi	f00ab3f22c	Detect and prevent mixed raw and non-raw sends Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8308	2019-03-13 11:00:43 -07:00
Tom Caputi	579ce7c5ae	Add bookmark v2 on-disk feature This patch adds the bookmark v2 feature to the on-disk format. This feature will be needed for the upcoming redacted sends and for an upcoming fix that for raw receives. The feature is not currently used by any code and thus this change is a no-op, aside from the fact that the user can now enable the feature. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8308	2019-03-13 10:58:39 -07:00
Tom Caputi	369aa501d1	Fix handling of maxblkid for raw sends Currently, the receive code can create an unreadable dataset from a correct raw send stream. This is because it is currently impossible to set maxblkid to a lower value without freeing the associated object. This means truncating files on the send side to a non-0 size could result in corruption. This patch solves this issue by adding a new 'force' flag to dnode_new_blkid() which will allow the raw receive code to force the DMU to accept the provided maxblkid even if it is a lower value than the existing one. For testing purposes the send_encrypted_files.ksh test has been extended to include a variety of truncated files and multiple snapshots. It also now leverages the xattrtest command to help ensure raw receives correctly handle xattrs. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8168 Closes #8487	2019-03-13 10:52:01 -07:00
Justin Gottula	cffa8372f4	Fix most zfs_arc_* mod params not actually being modifiable at runtime Most of the zfs_arc_* module parameters do not have their values used by the ARC code directly. Instead, there is a function, arc_tuning_update, which is called during module initialization and periodically thereafter, whose job is to fetch the module parameter values, clamp/ limit them appropriately, and then assign those values to a separate set of internal variables that are actually referenced by the ARC code. Commit `3ec34e55` featured an overhaul of arc_reclaim_thread, which is the former location where the post-init-time calls to arc_tuning_update would occur. The rework split the work previously done by the arc_reclaim_thread into a pair of replacement threads; and unfortunately, the call to arc_tuning_update fell through the cracks and was lost in the reorganization. This meant that changing almost any ARC-related zfs module parameter via /sys/module/zfs/parameters/ would result in the module parameter value itself appearing to change; however the modification would not actually propagate to the ARC code and have any real effect. This commit reinstates the post-init-time call to arc_tuning_update. It is now called during arc_adjust_cb_check; this should be equivalent to its former call location in arc_reclaim_thread. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #8405 Closes #8463	2019-03-12 15:03:59 -07:00
Alek P	4c0883fb4a	Avoid retrieving unused snapshot props This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8077	2019-03-12 13:13:22 -07:00
Brian Behlendorf	dd785b5b86	Fix vdev_initialize_restart / removal race Resolve a vdev_initialize crash uncovered by ztest. Similar to when starting a new initialization verify that a removal is not in progress. Additionally, do not restart when the thread already exists. This check is now congruent with the POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl(). Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8477	2019-03-12 10:39:47 -07:00
Olaf Faaland	3d31aad83e	MMP writes rotate over leaves Instead of choosing a leaf vdev quasi-randomly, by starting at the root vdev and randomly choosing children, rotate over leaves to issue MMP writes. This fixes an issue in a pool whose top-level vdevs have different numbers of leaves. The issue is that the frequency at which individual leaves are chosen for MMP writes is based not on the total number of leaves but based on how many siblings the leaves have. For example, in a pool like this: root-vdev +------+---------------+ vdev1 vdev2 \| \| \| +------+-----+-----+----+ disk1 disk2 disk3 disk4 disk5 disk6 vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1 is chosen, disk1 will be chosen. However, every time vdev2 is chosen, disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as many MMP writes as disk2. This may create wear issues in the case of SSDs. It also reduces the effectiveness of MMP as it depends on the writes being evenly distributed for the case where some devices fail or are partitioned. The new code maintains a list of leaf vdevs in the pool. MMP records the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list, continuing from the head if the tail is reached. It stops when a suitable leaf is found or all leaves have been examined. Added a test to verify MMP write distribution is even. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7953	2019-03-12 10:37:06 -07:00
George Wilson	b1b94e9644	zfs does not honor NFS sync write semantics The linux kernel's nfsd implementation use RWF_SYNC to determine if the write is synchronous or not. This flag is used to set the kernel's I/O control block flags. Unfortunately, ZFS was not updated to inspect these flags so NFS sync writes were not being honored. This change maps the IOCB_* flags to the ZFS equivalent. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8474 Closes #8452 Closes #8486	2019-03-11 09:13:37 -07:00
mzhivich	1118f99449	Fix lockdep between ds_lock and dd_lock in dsl_dataset_namelen() Booting debug kernel found an inconsistent lock dependency between dataset's ds_lock and its directory's dd_lock. [ 32.215336] ====================================================== [ 32.221859] WARNING: possible circular locking dependency detected [ 32.221861] 4.14.90+ #8 Tainted: G O [ 32.221862] ------------------------------------------------------ [ 32.221863] dynamic_kernel_/4667 is trying to acquire lock: [ 32.221864] (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.221941] but task is already holding lock: [ 32.221941] (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] [ 32.221983] which lock already depends on the new lock. [ 32.221983] the existing dependency chain (in reverse order) is: [ 32.221984] -> #1 (&dd->dd_lock){+.+.}: [ 32.221992] __mutex_lock+0xef/0x14c0 [ 32.222049] dsl_dir_namelen+0xd4/0x2d0 [zfs] [ 32.222093] dsl_dataset_namelen+0x2f1/0x430 [zfs] [ 32.222142] verify_dataset_name_len+0xd/0x40 [zfs] [ 32.222184] dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs] [ 32.222226] dmu_objset_find_dp_cb+0x40/0x60 [zfs] [ 32.222235] taskq_thread+0x969/0x1460 [spl] [ 32.222238] kthread+0x2fb/0x400 [ 32.222241] ret_from_fork+0x3a/0x50 [ 32.222241] -> #0 (&ds->ds_lock){+.+.}: [ 32.222246] lock_acquire+0x14f/0x390 [ 32.222248] __mutex_lock+0xef/0x14c0 [ 32.222291] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.222355] dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs] [ 32.222392] dmu_tx_assign+0xa61/0xdb0 [zfs] [ 32.222436] zfs_create+0x4e6/0x11d0 [zfs] [ 32.222481] zpl_create+0x194/0x340 [zfs] [ 32.222484] lookup_open+0xa86/0x16f0 [ 32.222486] path_openat+0xe56/0x2490 [ 32.222488] do_filp_open+0x17f/0x260 [ 32.222490] do_sys_open+0x195/0x310 [ 32.222491] SyS_open+0xbf/0xf0 [ 32.222494] do_syscall_64+0x191/0x4f0 [ 32.222496] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 32.222497] other info that might help us debug this: [ 32.222497] Possible unsafe locking scenario: [ 32.222498] CPU0 CPU1 [ 32.222498] ---- ---- [ 32.222499] lock(&dd->dd_lock); [ 32.222500] lock(&ds->ds_lock); [ 32.222502] lock(&dd->dd_lock); [ 32.222503] lock(&ds->ds_lock); [ 32.222504] * DEADLOCK * [ 32.222505] 3 locks held by dynamic_kernel_/4667: [ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0 [ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490 [ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by acquiring dd_lock on ds->ds_dir in dsl_dir_namelen(). However, ds->ds_dir should not be protected by ds_lock, so releasing it before call to dsl_dir_namelen() prevents the lockdep issue Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Michael Zhivich <mzhivich@akamai.com> Closes #8413	2019-03-11 09:11:04 -07:00
Paul Zuchowski	a73e8fdb93	Stack overflow in recursive bpobj_iterate_impl The function bpobj_iterate_impl overflows the stack when bpobjs are deeply nested. Rewrite the function to eliminate the recursion. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7674 Closes #7675 Closes #7908	2019-03-06 09:50:55 -08:00
Brian Behlendorf	96ebc5a1a4	Fix race in vdev_initialize_thread Before allowing new allocations to the metaslab we need to ensure that any issued initializing writes have been synced. Otherwise, it's possible for metaslab_block_alloc() to allocate a range which is about to be overwritten by an initializing IO. Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8461	2019-03-06 09:17:53 -08:00
Olaf Faaland	8133679ff0	Do not resume a pool if multihost is enabled When multihost is enabled, and a pool is suspended, return EINVAL in response to "zpool clear <pool>". The pool may have been imported on another host while I/O was suspended. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6933 Closes #8460	2019-02-28 17:56:19 -08:00
Matthew Ahrens	87c25d567f	abd_alloc should use scatter for >1K allocations abd_alloc() normally does scatter allocations, thus solving the problem that ABD originally set out to: the bulk of ZFS's allocations are single pages, which are faster to allocate and free, and don't suffer from internal fragmentation (and the inability to reclaim memory because some buffers in the slab are still allocated). However, the current code does linear allocations for 4KB and smaller allocations, defeating the purpose of ABD. Scatter ABD's use at least one page each, so sub-page allocations waste some space when allocated as scatter (e.g. 2KB scatter allocation wastes half of each page). Using linear ABD's for small allocations means that they will be put on slabs which contain many allocations. This can improve memory efficiency, but it also makes it much harder for ARC evictions to actually free pages, because all the buffers on one slab need to be freed in order for the slab (and underlying pages) to be freed. Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it's possible for them to actually waste more memory than scatter (one page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting 15/16th). Spill blocks are typically 512B and are heavily used on systems running selinux with the default dnode size and the `xattr=sa` property set. By default we will use linear allocations for 512B and 1KB, and scatter allocations for larger (1.5KB and up). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8455	2019-02-28 17:52:55 -08:00
Brian Behlendorf	6af7ba417e	Fix overly broad spa config lock The spa_txg_history_init_io() and spa_txg_history_fini_io() were mistakenly taking SCL_ALL when only SCL_CONFIG is required to access the vdev stats. This could result in a deadlock which was observed when running ztest. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8445	2019-02-27 10:49:22 -08:00
Serapheim Dimitropoulos	8eef997679	Error path in metaslab_load_impl() forgets to drop ms_sync_lock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8444	2019-02-25 11:08:52 -08:00
loli10K	c44a3ec059	zvol: allow rename of in use ZVOL dataset While ZFS allow renaming of in use ZVOLs at the DSL level without issues the ZVOL layer does not correctly update the renamed dataset if the device node is open (zv->zv_open_count > 0): trying to access the stale dataset name, for instance during a zfs receive, will cause the following failure: VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00) PANIC at zvol.c:1255:zvol_resume() Showing stack for process 1390 CPU: 0 PID: 1390 Comm: zfs Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62 Call Trace: [<0>] ? dump_stack+0x5d/0x78 [<0>] ? spl_panic+0xc9/0x110 [spl] [<0>] ? mutex_lock+0xe/0x2a [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs] [<0>] ? rrw_exit+0xc8/0x2e0 [zfs] [<0>] ? mutex_lock+0xe/0x2a [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs] [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs] [<0>] ? zvol_resume+0x178/0x280 [zfs] [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs] [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs] [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs] [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs] [<0>] ? __alloc_pages_nodemask+0x166/0xb50 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs] [<0>] ? handle_mm_fault+0x464/0x1140 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0 [<0>] ? __do_page_fault+0x177/0x410 [<0>] ? SyS_ioctl+0x81/0xa0 [<0>] ? async_page_fault+0x28/0x30 [<0>] ? system_call_fast_compare_end+0x10/0x15 Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6263 Closes #8371	2019-02-22 15:38:42 -08:00
loli10K	0c637f3100	zpool reports 16E expandsize on disks with oddball number of sectors The issue is caused by a small discrepancy in how userland creates the partition layout and the kernel estimates available space: * zpool command: subtract 9M from the usable device size, then align to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M EFI "reserved" partition. * kernel module: subtract 10M from the device size. 10M is the sum of 1M "start" partition alignment + 1m "end" partition alignment + 8M EFI "reserved" partition. For devices where the number of sectors is not a multiple of the alignment size the zpool command will create a partition layout which reserves less than 1M after the 8M EFI "reserved" partition: Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F Device Start End Sectors Size Type /dev/sda1 2048 2078719 2076672 1014M Solaris /usr & Apple ZFS /dev/sda9 2078720 2095103 16384 8M Solaris reserved 1 When the kernel module vdev_open() the device its max_asize ends up being slightly smaller than asize: this results in a huge number (16E) reported by metaslab_class_expandable_space(). This change prevents bdev_max_capacity() from returing a size smaller than bdev_capacity(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1468 Closes #8391	2019-02-22 15:36:34 -08:00
lidongyang	8d9e51c084	Fix dnode_hold_impl() soft lockup Soft lockups could happen when multiple threads trying to get zrl on the same dnode handle in order to allocate and initialize the dnode marked as DN_SLOT_ALLOCATED. Don't loop from beginning when we can't get zrl, otherwise we would increase the zrl refcount and nobody can actually lock it. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyangli@ddn.com> Closes #8433	2019-02-22 09:48:37 -08:00
Tomohiro Kusumi	9abbee4912	Don't enter zvol's rangelock for read bio with size 0 The SCST driver (SCSI target driver implementation) and possibly others may issue read bio's with a length of zero bytes. Although this is unusual, such bio's issued under certain condition can cause kernel oops, due to how rangelock is implemented. rangelock_add_reader() is not made to handle overlap of two (or more) ranges from read bio's with the same offset when one of them has size of 0, even though they conceptually overlap. Allowing them to enter rangelock results in kernel oops by dereferencing invalid pointer, or assertion failure on AVL tree manipulation with debug enabled kernel module. For example, this happens when read bio whose (offset, size) is (0, 0) enters rangelock followed by another read bio with (0, 4096) when (0, 0) rangelock is still locked, when there are no pending write bio's. It can also happen with reverse order, which is (0, N) followed by (0, 0) when (0, N) is still locked. More details mentioned in #8379. Kernel Oops on ->make_request_fn() of ZFS volume https://github.com/zfsonlinux/zfs/issues/8379 Prevent this by returning bio with size 0 as success without entering rangelock. This has been done for write bio after checking flusher bio case (though not for the same reason), but not for read bio. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8379 Closes #8401	2019-02-20 10:14:36 -08:00
Serapheim Dimitropoulos	928e8ad47d	Introduce auxiliary metaslab histograms This patch introduces 3 new histograms per metaslab. These histograms track segments that have made it to the metaslab's space map histogram (and are part of the spacemap) but have not yet reached the ms_allocatable tree on loaded metaslab's because these metaslab's are currently syncing and haven't gone through metaslab_sync_done() yet. The histograms help when we decide whether to load an unloaded metaslab in-order to allocate from it. When calculating the weight of an unloaded metaslab traditionally, we look at the highest bucket of its spacemap's histogram. The problem is that we are not guaranteed to be able to allocated that segment when we load the metaslab because it may still be at the freeing, freed, or defer trees. The new histograms are used when we try to calculate an unloaded metaslab's weight to deal with this issue by removing segments that have would not be in the allocatable tree at runtime. Note, that this method of dealing with this is not completely accurate as adjacent segments are not always consolidated in the space map histogram of a metaslab. In addition and to make things deterministic, we always reset the weight of unloaded metaslabs based on their space map weight (instead of doing that on a need basis). Thus, every time a metaslab is loaded and its weight is reset again (from the weight based on its space map to the one based on its allocatable range tree) we expect (and assert) that this change in weight can only get better if it doesn't stay the same. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8358	2019-02-20 09:59:56 -08:00
loli10K	bb1be77a35	Prevent user accounting on readonly pool Trying to mount a dataset from a readonly pool could inadvertently start the user accounting upgrade task, leading to the following failure: VERIFY3(tx->tx_threads == 2) failed (0 == 2) PANIC at txg.c:680:txg_wait_synced() Showing stack for process 2541 CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: [<0>] ? dump_stack+0x5d/0x78 [<0>] ? spl_panic+0xc9/0x110 [spl] [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs] [<0>] ? dmu_object_next+0x77/0x130 [zfs] [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs] [<0>] ? txg_wait_synced+0x91/0x220 [zfs] [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs] [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs] [<0>] ? taskq_thread+0x2cc/0x5d0 [spl] [<0>] ? wake_up_state+0x10/0x10 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl] [<0>] ? kthread+0xbd/0xe0 [<0>] ? kthread_create_on_node+0x180/0x180 [<0>] ? ret_from_fork+0x58/0x90 [<0>] ? kthread_create_on_node+0x180/0x180 This patch updates both functions responsible for checking if we can perform user accounting to verify the pool is not readonly. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8424	2019-02-19 18:41:18 -08:00
Sara Hartse	f545b6ae00	Delay injection can cause indefinitely hung zios If we hit the (NSEC_TO_TICK(diff) == 0) condition in zio_delay_interrupt, zio_interrupt is never called and the zio does not progress. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: sara hartse <sara.hartse@delphix.com> Closes #8404	2019-02-15 14:44:56 -08:00
Tim Chase	638dd5f44e	zio_deadman_impl() fix and enhancement Add the zio_deadman_log_all tunable to print all zios in zio_deadman_impl(). Also, in all cases, display the depth of the zio relative to the original parent zio. This is meant to be used by developers to gain diagnostic information for hangs which don't involve fully set-up zio trees or are otherwise stuck or hung in an early stage. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #8362	2019-02-15 12:44:24 -08:00
Paul Zuchowski	9c5e88b1de	zfs should optionally send holds Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7513	2019-02-15 12:41:38 -08:00
Tomohiro Kusumi	2d76ab9e42	Fix obsolete comment on rangelock `5d43cc9a59` renamed it to rangelock_enter(). Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8408	2019-02-14 15:13:58 -08:00
Alek P	65282ee9e0	Freeing throttle should account for holes Deletion throttle currently does not account for holes in a file. This means that it can activate when it shouldn't. To fix it we switch the throttle to be based on the number of L1 blocks we will have to dirty when freeing Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #7725 Closes #7888	2019-02-12 12:01:08 -08:00
Alek P	dcec0a12c8	port async unlinked drain from illumos-nexenta This patch is an async implementation of the existing sync zfs_unlinked_drain() function. This function is called at mount time and is responsible for freeing znodes that we didn't get to freeing before. We don't have to hold mounting of the dataset until the unlinked list is fully drained as is done now. Since we can process the unlinked set asynchronously this results in a better user experience when mounting a dataset with entries in the unlinked set. Reviewed by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8142	2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos	425d3237ee	Get rid of space_map_update() for ms_synced_length Initially, metaslabs and space maps used to be the same thing in ZFS. Later, we started differentiating them by referring to the space map as the on-disk state of the metaslab, making the metaslab a higher-level concept that is metadata that deals with space accounting. Today we've managed to split that code furthermore, with the space map being its own on-disk data structure used in areas of ZFS besides metaslabs (e.g. the vdev-wide space maps used for zpool checkpoint or vdev removal features). This patch refactors the space map code to further split the space map code from the metaslab code. It does so by getting rid of the idea that the space map can have a different in-core and on-disk length (sm_length vs smp_length) which is something that is only used for the metaslab code, and other consumers of space maps just have to deal with. Instead, this patch introduces changes that move the old in-core length of the metaslab's space map to the metaslab structure itself (see ms_synced_length field) while making the space map code only care about the actual space map's length on-disk. The result of this is that space map consumers no longer have to deal with syncing two different lengths for the same structure (e.g. space_map_update() goes away) while metaslab specific behavior stays within the metaslab code. Specifically, the ms_synced_length field keeps track of the amount of data metaslab_load() can read from the metaslab's space map while working concurrently with metaslab_sync() that may be appending to that same space map. As a side note, the patch also adds a few comments around the metaslab code documenting some assumptions and expected behavior. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8328	2019-02-12 10:38:11 -08:00
loli10K	d8d418ff0c	ZVOLs should not be allowed to have children zfs create, receive and rename can bypass this hierarchy rule. Update both userland and kernel module to prevent this issue and use pyzfs unit tests to exercise the ioctls directly. Note: this commit slightly changes zfs_ioc_create() ABI. This allow to differentiate a generic error (EINVAL) from the specific case where we tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>	2019-02-08 15:44:15 -08:00
loli10K	4417096956	Pool allocation classes misplacing small file blocks Due to an off-by-one condition in spa_preferred_class() we are picking the "normal" allocation class instead of the "special" one for file blocks with size equal to the special_small_blocks property value. This change fix the small code issue, update the ZFS Test Suite and the zfs(8) man page. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8351 Closes #8361	2019-02-08 12:32:12 -08:00
Tim Chase	0902c4577f	Fix ARC stats for embedded blkptrs Re-factor arc_read() to better account for embedded data blkptrs. Previously, reading the payload from an embedded blkptr would cause arcstats such as demand_metadata_misses to be bumped when there was actually no cache "miss" because the data are already available in the blkptr. The following test procedure was used to demonstrate the problem: zpool create tank ... zfs create -o compression=lz4 tank/fs echo blah > /tank/fs/blah stat /tank/fs/blah grep 'meta.*mis' /proc/spl/kstat/zfs/arcstats and repeating the last two steps to watch the metadata miss counter increment. This can also be demonstrated via the zfs_arc_miss DTRACE4 probe in arc_read(). Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #8319	2019-02-04 09:33:30 -08:00
Serapheim Dimitropoulos	6c926f426a	Simplify log vdev removal code Get rid of the majority metaslab metadata when removing log vdevs in spa_vdev_remove_log() with a call to metaslab_fini() instead of duplicating a lot of that in vdev_remove_empty_log(). Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8347	2019-01-31 09:17:52 -08:00
Serapheim Dimitropoulos	7558997d2f	vs_alloc can underflow in L2ARC vdevs The current L2 ARC device code consistently uses psize to increment vs_alloc but varies between psize and lsize when decrementing it. The result of this behavior is that vs_alloc can be decremented more that it is incremented and underflow. This patch changes the code so asize is used anywhere. In addition, it ensures that vs_alloc gets incremented by the L2 ARC device code as buffers are written and not at the end of the l2arc_write_buffers() routine. The latter (and old) way would temporarily underflow vs_alloc as buffers that were just written, would be destroyed while l2arc_write_buffers() was still looping. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8298	2019-01-31 09:16:39 -08:00
Sara Hartse	2747f599ff	Don't acquire zthr_request_lock in zthr_wakeup Address a deadlock caused by simultaneous wakeup and cancel on a zthr by remove the hold of zthr_request_lock from zthr_wakeup. This allows thr_wakeup to not block a thread that is in the process of being cancelled. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Closes #8333	2019-01-30 12:31:16 -08:00
Brian Behlendorf	26a856594f	Linux 5.0 compat: Fix bio_set_dev() The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the GPL-only bio_associate_blkg() symbol thus inadvertently converting the entire macro. Provide a minimal version which always assigns the request queue's root_blkg to the bio. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8287	2019-01-28 10:11:45 -08:00
Tony Hutter	05805494dd	Linux 5.0 compat: Convert MS_* macros to SB_* In the 5.0 kernel, only the mount namespace code should use the MS_* macos. Filesystems should use the SB_* ones. https://patchwork.kernel.org/patch/10552493/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8264	2019-01-28 10:11:39 -08:00
Tony Hutter	031cea17a3	Linux 5.0 compat: Use totalram_pages() totalram_pages() was converted to an atomic variable in 5.0: https://patchwork.kernel.org/patch/10652795/ Its value should now be read though the totalram_pages() helper function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8263	2019-01-28 10:11:14 -08:00
Serapheim Dimitropoulos	c853f382db	Change target size of metaslabs from 256GB to 16GB = Old behavior For vdev sizes 100GB to 50TB we keep ~200 metaslabs per vdev and the metaslab size grows from 512MB to 256GB. For vdev's bigger than that we start increasing the number of metaslabs until we hit the 128K limit. = New Behavior For vdev sizes 100GB to 3TB we keep ~200 metaslabs per vdev and the metaslab size grows from 512MB to 16GB. For vdev's bigger than that we start increasing the number of metaslabs until we hit the 128K limit. = Reasoning The old behavior makes metaslabs grow in size when the vdev range is between 3TB (ms_size 16GB) and 32PB (ms_size 256GB). Even though keeping the number of metaslabs is good in terms of potential number of I/Os per TXG, these bigger metaslabs take longer to be loaded and after they are loaded they can take up a lot of memory because of their range trees. This change tries to put a boundary in memory and loading time for the specific range of vdev sizes. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8324	2019-01-25 16:38:27 -08:00
Serapheim Dimitropoulos	df72b8bebe	Rename range_tree_verify to range_tree_verify_not_present The range_tree_verify function looks for a segment in a range tree and panics if the segment is present on the tree. This patch gives the function a more descriptive name. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8327	2019-01-25 09:51:24 -08:00
Tim Chase	107dd2b174	Use proper tag for spa config refcounts in mmp_write_uberblock() This allows the spa config refcounts to use tracking in debug builds without triggering the "No such hold %p on refcount" panic. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #8326	2019-01-25 09:50:06 -08:00
Tom Caputi	b5d693581d	Fix bad kmem_free() in zvol_rename_minors_impl() Currently, zvol_rename_minors_impl() calls kmem_asprintf() to allocate and initialize a string. This function is a thin wrapper around the kernel's kvasprintf() and does not call into the SPL's kmem tracking code when it is enabled. However, this function frees the string with the tracked kmem_free() instead of the untracked strfree(), which causes the SPL kmem tracking code to believe that the function is attempting to free memory it never allocated, triggering an ASSERT. This patch simply corrects this issue. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8307	2019-01-23 11:38:05 -08:00
loli10K	0a10863194	ztest: creates partially initialized root dataset Since `d8fdfc2` was integrated dsl_pool_create() does not call dmu_objset_create_impl() for the root dataset when running in userland (ztest): this creates a pool with a partially initialized root dataset. Trying to import and use this pool results in both zpool and zfs executables dumping core. Fix this by adopting an alternative change suggested in OpenZFS 8607 code review. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tom Caputi <tcaputi@datto.com> Original-patch-by: Robert Mustacchi <rm@joyent.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8277	2019-01-18 11:14:01 -08:00
Brian Behlendorf	ad63507135	Remove zfs_sync() panicking kernel check This check provides no real additional protection and unnecessarily introduces a dependency on the "oops_in_progress" kernel symbol. Remove the check, it there are special circumstances on other platforms which make this a requirement it can be reintroduced for all relevant call paths in a more portable comprehensive manor. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8297	2019-01-18 11:11:47 -08:00
Serapheim Dimitropoulos	b194fab0fb	Factor metaslab_load_wait() in metaslab_load() Most callers that need to operate on a loaded metaslab, always call metaslab_load_wait() before loading the metaslab just in case someone else is already doing the work. Factoring metaslab_load_wait() within metaslab_load() makes the later more robust, as callers won't have to do the load-wait check explicitly every time they need to load a metaslab. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8290	2019-01-18 11:10:32 -08:00
Tom Caputi	960347d3a6	Fix 0 byte memory leak in zfs receive Currently, when a DRR_OBJECT record is read into memory in receive_read_record(), memory is allocated for the bonus buffer. However, if the object doesn't have a bonus buffer the code will still "allocate" the zero bytes, but the memory will not be passed to the processing thread for cleanup later. This causes the spl kmem tracking code to report a leak. This patch simply changes the code so that it only allocates this memory if it has a non-zero length. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8266	2019-01-18 11:06:48 -08:00
loli10K	60b0a963f5	Off-by-one in zap_leaf_array_create() Trying to set user properties with their length 1 byte shorter than the maximum size triggers an assertion failure in zap_leaf_array_create(): panic[cpu0]/thread=ffffff000a092c40: assertion failed: num_integers * integer_size < (8<<10) (0x2000 < 0x2000), file: ../../common/fs/zfs/zap_leaf.c, line: 233 ffffff000a092500 genunix:process_type+167c35 () ffffff000a0925a0 zfs:zap_leaf_array_create+1d2 () ffffff000a092650 zfs:zap_entry_create+1be () ffffff000a092720 zfs:fzap_update+ed () ffffff000a0927d0 zfs:zap_update+1a5 () ffffff000a0928d0 zfs:dsl_prop_set_sync_impl+5c6 () ffffff000a092970 zfs:dsl_props_set_sync_impl+fc () ffffff000a0929b0 zfs:dsl_props_set_sync+79 () ffffff000a0929f0 zfs:dsl_sync_task_sync+10a () ffffff000a092a80 zfs:dsl_pool_sync+3a3 () ffffff000a092b50 zfs:spa_sync+4e6 () ffffff000a092c20 zfs:txg_sync_thread+297 () ffffff000a092c30 unix:thread_start+8 () This patch simply corrects the assertion. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8278	2019-01-18 09:58:46 -08:00
Serapheim Dimitropoulos	8dc2197b7b	Simplify spa_sync by breaking it up to smaller functions The point of this refactoring is to break the high-level conceptual steps of spa_sync() to their own helper functions. In general large functions can enhance readability if structured well, but in this case the amount of conceptual steps taken could use the help of helper functions. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8293	2019-01-18 09:50:16 -08:00
Tom Caputi	305781da4b	Fix error handling incallers of dbuf_hold_level() Currently, the functions dbuf_prefetch_indirect_done() and dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot fail. In the event of an error the former will cause a NULL pointer dereference and the later will trigger a VERIFY. This patch adds error handling to these functions and their callers where necessary. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8291	2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos	75058f3303	Remove unused vdev_t fields The following fields from the vdev_t struct are not used anywhere. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8285	2019-01-17 15:41:12 -08:00
Brian Behlendorf	52b684236d	ztest: scrub ddt repair The ztest_ddt_repair() test is designed inflict damage to the ddt which can be repairable by a scrub. Unfortunately, this repair logic was broken at some point and it went undetected. This issue is not specific to ztest, but thankfully this extra redundancy is rarely enabled and even more rarely needed. The root cause was identified to be the ddt_bp_create() function called by dsl_scan_ddt_entry() which did not set the dedup bit of the generated block pointer. The consequence of this was that the ZIO_DDT_READ_PIPELINE was never enabled for the block pointer during the scrub, and the dedup ditto repair logic was never run. Note that for demand reads which don't rely on ddt_bp_create() the required pipeline stages would be enabled and the repair performed. This was resolved by unconditionally setting the dedup bit in ddt_bp_create(). This way all codes paths which may need to perform a repair from a block pointer generated from the dtt entry will be able too. The only exception is that the dedup bit is cleared in ddt_phys_free() which is required to avoid leaking space. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8270	2019-01-17 15:25:00 -08:00
Serapheim Dimitropoulos	419ba59145	Update vdev_is_spacemap_addressable() for new spacemap encoding Since the new spacemap encoding was ported to ZoL that's no longer a limitation. This patch updates vdev_is_spacemap_addressable() that was performing that check. It also updates the appropriate test to ensure that the same functionality is tested. The test does so by creating pools that don't have the new spacemap encoding enabled - just the checkpoint feature. This patch also reorganizes that same tests in order to cut in half its memory consumption. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8286	2019-01-16 15:06:20 -08:00
Brian Behlendorf	64bdf63f5c	ztest: split block reconstruction Increase the default allowed number of reconstruction attempts. There's not an exact right number for this setting. It needs to be set large enough to cover any realistic failure scenarios and small enough to avoid stalling the IO pipeline and invoking the dead man detection. The current value of 256 was empirically determined to be too low based on multi-day runs of ztest. The fault injection code would inject more damage than could be reconstructed given the relatively small number of attempts. However, in all observed cases the block could be reconstructed using a slightly higher limit. Based on local testing increasing the default value to 4096 was determined to strike the best balance. Checking all combinations takes less than 10s in the worst case, and has so far eliminated the vast majority of false positives detected by ztest. This delay is roughly on par with how long retries may be performed to a misbehaving HDD and was deemed to be reasonable. Better to err on the side of a brief delay rather than fail to reconstruct the data. Lastly, the -Y flag has been added to zdb to make it easy to try all possible combinations when performing split block reconstruction. For badly damaged blocks with 18 splits, they can be fully enumerated within a few minutes. This has been done to ensure permanent errors are never incorrectly reported when ztest verifies the pool with zdb. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8271	2019-01-16 14:10:02 -08:00
Tom Caputi	5e7f3ace58	Fix zio leak in dbuf_read() Currently, dbuf_read() may decide to create a zio_root which is used as a parent for any child zios created in dbuf_read_impl(). However, if there is an error in dbuf_read_impl(), this zio is never executed and ends up leaked. This patch simply ensures that we always execute the root zio, even i it has no real work to do. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8267	2019-01-15 12:23:40 -08:00
Brian Behlendorf	d611989fdc	Minor spelling corrections Some minor spelling mistakes and typos. No functional changes. Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8272	2019-01-13 10:11:52 -08:00
Serapheim Dimitropoulos	61c3391acc	Serialize ZTHR operations to eliminate races Adds a new lock for serializing operations on zthrs. The commit also includes some code cleanup and refactoring. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8229	2019-01-13 10:09:46 -08:00
Paul Zuchowski	83c796c5e9	zfs filesystem skipped by df -h On full pool when pool root filesystem references very few bytes, the f_blocks returned to statvfs is 0 but should be at least 1. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #8253 Closes #8254	2019-01-13 10:06:13 -08:00
Brian Behlendorf	6955b40138	Provide more flexible object allocation interface Object allocation performance can be improved for complex operations by providing an interface which returns the newly allocated dnode. This allows the caller to immediately use the dnode without incurring the expense of looking up the dnode by object number. The functions dmu_object_alloc_hold(), zap_create_hold(), and dmu_bonus_hold_by_dnode() were added for this purpose. The zap_create_* functions have been updated to take advantage of this new functionality. The dmu_bonus_hold_impl() function should really have never been included in sys/dmu.h and was removed. It's sole caller was converted to use dmu_bonus_hold_by_dnode(). The new symbols have been exported for use by Lustre. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8015	2019-01-10 14:37:43 -08:00
Tom Caputi	58769a4ebd	Don't allow dnode allocation if dn_holds != 0 This patch simply fixes a small bug where dnode_hold_impl() could attempt to allocate a dnode that was in the process of being freed, but which still had active references. This patch simply adds the required check. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8249	2019-01-10 14:36:23 -08:00
loli10K	0f5f23869a	zfs receive and rollback can skew filesystem_count This commit fixes a small issue which causes both zfs receive and rollback operations to incorrectly increase the "filesystem_count" property value. This change also adds a new test group "limits" to the ZFS Test Suite to exercise both filesystem_count/limit and snapshot_count/limit functionality. Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8232	2019-01-08 10:17:46 -08:00
asomers	f384c045d8	OpenZFS 8473 - scrub does not detect errors on active spares Scrubbing is supposed to detect and repair all errors in the pool. However, it wrongly ignores active spare devices. The problem can easily be reproduced in OpenZFS at git rev 0ef125d with these commands: truncate -s 64m /tmp/a /tmp/b /tmp/c sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c sudo zpool replace testpool /tmp/a /tmp/c /bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c sync sudo zpool scrub testpool zpool status testpool # Will show 0 errors, which is wrong sudo zpool offline testpool /tmp/a sudo zpool scrub testpool zpool status testpool # Will show errors on /tmp/c, # which should've already been fixed FreeBSD head is partially affected: the first scrub will detect some errors, but the second scrub will detect more. This same test was run on Linux before applying the fix and the FreeBSD head behavior was observed. Authored by: asomers <asomers@FreeBSD.org> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored by: Spectra Logic Corp OpenZFS-issue: https://www.illumos.org/issues/8473 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/e20ec8879 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/554675ee Closes #8251	2019-01-08 09:51:30 -08:00
George Wilson	c10d37dd9f	zfs initialize performance enhancements PROBLEM ======== When invoking "zpool initialize" on a pool the command will create a thread to initialize each disk. Unfortunately, it does this serially across many transaction groups which can result in commands taking a long time to return to the user and may appear hung. The same thing is true when trying to suspend/cancel the operation. SOLUTION ========= This change refactors the way we invoke the initialize interface to ensure we can start or stop the intialization in just a few transaction groups. When stopping or cancelling a vdev initialization perform it in two phases. First signal each vdev initialization thread that it should exit, then after all threads have been signaled wait for them to exit. On a pool with 40 leaf vdevs this reduces the vdev initialize stop/cancel time from ~10 minutes to under a second. The reason for this is spa_vdev_initialize() no longer needs to wait on multiple full TXGs per leaf vdev being stopped. This commit additionally adds some missing checks for the passed "initialize_vdevs" input nvlist. The contents of the user provided input "initialize_vdevs" nvlist must be validated to ensure all values are uint64s. This is done in zfs_ioc_pool_initialize() in order to keep all of these checks in a single location. Updated the innvl and outnvl comments to match the formatting used for all other new sytle ioctls. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8230	2019-01-07 11:03:08 -08:00
George Wilson	619f097693	OpenZFS 9102 - zfs should be able to initialize storage devices PROBLEM ======== The first access to a block incurs a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are "thick provisioned", where supported by the platform (VMware). This can create a large delay in getting a new virtual machines up and running (or adding storage to an existing Engine). If the thick provision step is omitted, write performance will be suboptimal until all blocks on the LUN have been written. SOLUTION ========= This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty. When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design: - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...] - start, suspend, or cancel initialization - Creates new open-context thread for each vdev - Thread iterates through all metaslabs in this vdev - Each metaslab: - select a metaslab - load the metaslab - mark the metaslab as being zeroed - walk all free ranges within that metaslab and translate them to ranges on the leaf vdev - issue a "zeroing" I/O on the leaf vdev that corresponds to a free range on the metaslab we're working on - continue until all free ranges for this metaslab have been "zeroed" - reset/unmark the metaslab being zeroed - if more metaslabs exist, then repeat above tasks. - if no more metaslabs, then we're done. - progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored: - the last offset that has been initialized - the state of the initialization process (i.e. active, suspended, or canceled) - the start time for the initialization - progress is reported via the zpool status command and shows information for each of the vdevs that are initializing Porting notes: - Added zfs_initialize_value module parameter to set the pattern written by "zpool initialize". - Added zfs_vdev_{initializing,removal}_{min,max}_active module options. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: loli10K <ezomori.nozomu@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/9102 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb Closes #8230	2019-01-07 10:37:26 -08:00
Brian Behlendorf	65ca2c1eb9	Fix 'zpool remap' freeing race The dmu_objset_remap_indirects_impl() logic depends on dnode_hold() returning ENOENT for dnodes which will be freed and should be skipped. This behavior can only be relied upon when taking a new hold and while the caller has an open transaction. This ensures that the open txg cannot advance and that a concurrent free will end up in the same txg (which is critical). Relying on an existing hold will not prevent dnode_free() from succeeding. The solution is to take an additional dnode_hold() after assigning the transaction. This ensures the remap will never dirty the dnode if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT). Randomly set zfs_object_remap_one_indirect_delay_ms in ztest. This increases the likelihood of an operation racing with the remap. Converted from ticks to milliseconds. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8215	2019-01-02 11:46:04 -08:00
Brad Lewis	3ec34e5527	OpenZFS 9284 - arc_reclaim_thread has 2 jobs Following the fix for 9018 (Replace kmem_cache_reap_now() with kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks while reaping. However, the code is still confusing and error-prone, because this thread has two responsibilities. We should instead separate this into two threads each with their own responsibility: 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which improves `arc_is_overflowing()` 2. keep enough free memory in the system, by calling `arc_kmem_reap_now()` plus `arc_shrink()`, which improves `arc_available_memory()`. Furthermore, we can use the zthr infrastructure to separate the "should we do something" from "do it" parts of the logic, and normalize the start up / shut down of the threads. Authored by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Tim Kordas <tim.kordas@joyent.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Brad Lewis <brad.lewis@delphix.com> Signed-off-by: Brad Lewis <brad.lewis@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/9284 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9 Closes #8165	2018-12-26 13:22:28 -08:00
Tom Caputi	00f198de6b	Fix zfs_dirty_data_sync_percent documentation In `dfbe2675` zfs_dirty_data_sync was changed to a new tunable named zfs_dirty_data_sync_percent. Unfortunately, the module parameter documentation is the code was not updated accordingly. This patch simply corrects that. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8212	2018-12-18 14:47:33 -08:00
Tom Caputi	2a6078450d	Fix zap_update() ASSERT from ztest This patch simply removes an invalid assert from the zap_update() function. The ASSERT is invalid because it does not hold the zap lock from the time it fetches the old value to the time it confirms that it is what it should be. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8209	2018-12-14 10:04:11 -08:00
Andriy Gapon	dc1c630b8a	OpenZFS 9630 - add lzc_rename and lzc_destroy to libzfs_core Porting Notes: * Additional changes to recv_rename_impl() were required due to encryption code not being merged in OpenZFS yet. * libzfs_core python bindings (pyzfs) were updated to fully support both lzc_rename() and lzc_destroy() Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: loli10K <ezomori.nozomu@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/9630 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63 Closes #8207	2018-12-14 09:49:45 -08:00
Tom Caputi	5aa95ba0d3	Fix resilver writes in vdev_indirect_io_start This patch addresses an issue found in ztest where resilver write zios that were passed to an indirect vdev would end up being handled as though they were resilver read zios. This caused issues where the zio->io_abd would be both read to and written from at the same time, causing asserts to fail. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8193	2018-12-13 14:18:48 -08:00
Olaf Faaland	fa61e72340	Rename macro ZFS_MINOR due to Lustre conflict Macro ZFS_MINOR, introduced in commit `a6cc9756` to record the chosen static minor number for /dev/zfs, conflicts with an existing macro in Lustre. The lustre macro (along with _MAJOR, _PATCH, _FIX) is used to record the zfsonlinux version Lustre is being built against. Since the Lustre macro came first, and is used in past versions of lustre at least going back to 2.10, it makes sense to rename the macro in ZFS instead of doing so in Lustre which would require backporting the patch. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #8195	2018-12-10 16:52:49 -08:00
Prakash Surya	900d09b285	OpenZFS 9962 - zil_commit should omit cache thrash As a result of the changes made in 8585, it's possible for an excessive amount of vdev flush commands to be issued under some workloads. Specifically, when the workload consists of mostly async write activity, interspersed with some sync write and/or fsync activity, we can end up issuing more flush commands to the underlying storage than is actually necessary. As a result of these flush commands, the write latency and overall throughput of the pool can be poorly impacted (latency increases, throughput decreases). Currently, any time an lwb completes, the vdev(s) written to as a result of that lwb will be issued a flush command. The intenion is so the data written to that vdev is on stable storage, prior to communicating to any waiting threads that their data is safe on disk. The problem with this scheme, is that sometimes an lwb will not have any threads waiting for it to complete. This can occur when there's async activity that gets "converted" to sync requests, as a result of calling the zil_async_to_sync() function via zil_commit_impl(). When this occurs, the current code may issue many lwbs that don't have waiters associated with them, resulting in many flush commands, potentially to the same vdev(s). For example, given a pool with a single vdev, and a single fsync() call that results in 10 lwbs being written out (e.g. due to other async writes), that will result in 10 flush commands to that single vdev (a flush issued after each lwb write completes). Ideally, we'd only issue a single flush command to that vdev, after all 10 lwb writes completed. Further, and most important as it pertains to this change, since the flush commands are often very impactful to the performance of the pool's underlying storage, unnecessarily issuing these flush commands can poorly impact the performance of the lwb writes themselves. Thus, we need to avoid issuing flush commands when possible, in order to acheive the best possible performance out of the pool's underlying storage. This change attempts to address this problem by changing the ZIL's logic to only issue a vdev flush command when it detects an lwb that has a thread waiting for it to complete. When an lwb does not have threads waiting for it, the responsibility of issuing the flush command to the vdevs involved with that lwb's write is passed on to the "next" lwb. It's only once a write for an lwb with waiters completes, do we issue the vdev flush command(s). As a result, now when we issue the flush(s), we will issue them to the vdevs involved with that specific lwb's write, but potentially also to vdevs involved with "previous" lwb writes (i.e. if the previous lwbs did not have waiters associated with them). Thus, in our prior example with 10 lwbs, it's only once the last lwb completes (which will be the lwb containing the waiter for the thread that called fsync) will we issue the vdev flush command; all of the other lwbs will find they have no waiters, so they'll pass the responsibility of the flush to the "next" lwb (until reaching the last lwb that has the waiter). Porting Notes: * Reconciled conflicts with the fastwrite feature. Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9962 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6 Closes #8188	2018-12-07 11:09:42 -08:00
Prakash Surya	53b1f5eac6	OpenZFS 9963 - Separate tunable for disabling ZIL vdev flush Porting Notes: * Add options to zfs-module-parameters(5) man page. * zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since the latter doesn't get built for user space. Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: George Melikov <mail@gmelikov.ru> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9963 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125 Closes #8186	2018-12-07 11:06:29 -08:00
George Wilson	18b14b17c8	OpenZFS 9993 - zil writes can get delayed in zio pipeline Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: George Melikov <mail@gmelikov.ru> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9993 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2258ad0b Closes #8185	2018-12-07 11:05:35 -08:00
Tom Caputi	d6496040d9	Ensure dsl scan prefetch queue is emptied This patch simply ensures that scn->scn_prefetch_queue is emptied before the kernel module is unloaded and when scanning completes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8178	2018-12-06 09:47:23 -08:00
Brian Behlendorf	78e2139467	Fix dnode_hold() freeing dnode behavior Commit `4c5b89f59` refactored dnode_hold() and in the process accidentally introduced a slight change in behavior which was not intended. The required behavior is that once the ZPL, or other consumer, declares its intent to free a dnode then dnode_hold() should immediately start failing. This updated code wouldn't return the failure until after it was freed. When DNODE_MUST_BE_ALLOCATED is set it must return ENOENT, and when DNODE_MUST_BE_FREE is set it must return EEXIST; This issue was uncovered by ztest_remap() which attempted to remap a freeing object which should have been skipped as described by the comment in dmu_objset_remap_indirects_impl(). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8172	2018-12-05 09:29:33 -08:00
Tom Caputi	fedef6dd59	Fix ztest deadlock in spa_vdev_remove() This patch corrects an issue where spa_vdev_remove() would call spa_history_log_internal() while holding the spa config lock. This function may decide to block until the next txg if the current one seems too full. However, since the thread is holding the config log, the txg sync thread cannot progress and the system ends up deadlocked. This patch simply moves all calls to spa_history_log_internal() outside of the config lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8162	2018-12-04 09:54:05 -08:00
Brian Behlendorf	7c9a42921e	Detect IO errors during device removal * Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6900 Closes #8161	2018-12-04 09:37:37 -08:00
Tom Caputi	c40a1124e1	Fix consistency of ztest_device_removal_active ztest currently uses the boolean flag ztest_device_removal_active to protect some tests that may not run successfully if they occur at the same time as ztest_device_removal(). Unfortunately, in the event that ztest is in the middle of a device removal when it decides to issue a SIGKILL, the device removal will be automatically restarted (without setting the flag) when the pool is re-imported on the next run. This patch corrects this by ensuring that any in-progress removals are completed before running further tests after the re-import. This patch also makes a few small changes to prevent race conditions involving the creation and destruction of spa->spa_vdev_removal, since this field is not protected by any locks. Some checks that may run concurrently with setting / unsetting this field have been updated to check spa->spa_removing_phys.sr_state instead. The most significant change here is that spa_removal_get_stats() no longer accounts for in-flight work done, since that could result in a NULL pointer dereference. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8105	2018-11-28 20:47:09 -08:00
LOLi	c71c8c715b	zfs_dbgmsg() is not safe from every context This commit reverts to using printk() instead of zfs_dbgmsg() to log messages in vdev_disk_error(): this is necessary because the latter can be called from interrupt context where we are not allowed to sleep. Unfortunately zfs_dbgmsg() performs its allocations calling kmalloc() with the KM_SLEEP flag which may result in the following oops: BUG: scheduling while atomic: swapper/4/0/0x10000100 Call Trace: <IRQ> [<0>] dump_stack+0x19/0x1b ... [<0>] spl_kmem_alloc+0xdf/0x140 [spl] <-- kmem_alloc(size, KM_SLEEP) [<0>] __dprintf+0x69/0x150 [zfs] [<0>] ? kmem_cache_free+0x1e2/0x200 [<0>] vdev_disk_error.part.15+0x5f/0x70 [zfs] [<0>] vdev_disk_io_flush_completion+0x48/0x70 [zfs] [<0>] bio_endio+0x67/0xb0 [<0>] blk_update_request+0x90/0x360 ... [<0>] scsi_finish_command+0xdc/0x140 [<0>] scsi_softirq_done+0x132/0x160 [<0>] blk_done_softirq+0x96/0xc0 [<0>] __do_softirq+0xf5/0x280 [<0>] call_softirq+0x1c/0x30 [<0>] do_softirq+0x65/0xa0 [<0>] irq_exit+0x105/0x110 [<0>] do_IRQ+0x56/0xf0 [<0>] common_interrupt+0x162/0x162 <EOI> [<0>] ? cpuidle_enter_state+0x54/0xd0 [<0>] cpuidle_idle_call+0xde/0x230 [<0>] arch_cpu_idle+0xe/0xb0 [<0>] cpu_startup_entry+0x14a/0x1e0 [<0>] start_secondary+0x1f7/0x270 [<0>] start_cpu+0x5/0x14 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8137 Closes #8150	2018-11-28 11:29:57 -08:00
Tom Caputi	cef48f14da	Remove races from scrub / resilver tests Currently, several tests in the ZFS Test Suite that attempt to test scrub and resilver behavior occasionally fail. A big reason for this is that these tests use a combination of zinject and zfs_scan_vdev_limit to attempt to slow these operations enough to verify their test commands. This method works most of the time, but provides no guarantees and leads to flaky behavior. This patch adds a new tunable, zfs_scan_suspend_progress, that ensures that scans make no progress, guaranteeing that tests can be run without racing. This patch also changes zfs_remove_max_bytes_pause to match this new tunable. This provides some consistency between these two similar tunables and ensures that the tunable will not misbehave on 32-bit systems. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8111	2018-11-28 10:12:08 -08:00
LOLi	c8fd652ce7	Fix coverity defects: CID 184285 CID 184285: Read from pointer after free (USE_AFTER_FREE) This patch fixes an use-after-free in vdev_config_generate_stats() moving the kmem_free() call at the end of the function. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8120	2018-11-11 18:09:00 -08:00
loli10K	d48091de81	zed: detect and offline physically removed devices This commit adds a new test case to the ZFS Test Suite to verify ZED can detect when a device is physically removed from a running system: the device will be offlined if a spare is not available in the pool. We implement this by using the existing libudev functionality and without relying solely on the FM kernel module capabilities which have been observed to be unreliable with some kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1537 Closes #7926	2018-11-09 11:17:24 -08:00
Tony Hutter	ad796b8a3b	Add zpool status -s (slow I/Os) and -p (parseable) This patch adds a new slow I/Os (-s) column to zpool status to show the number of VDEV slow I/Os. This is the number of I/Os that didn't complete in zio_slow_io_ms milliseconds. It also adds a new parsable (-p) flag to display exact values. NAME STATE READ WRITE CKSUM SLOW testpool ONLINE 0 0 0 - mirror-0 ONLINE 0 0 0 - loop0 ONLINE 0 0 0 20 loop1 ONLINE 0 0 0 0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7756 Closes #6885	2018-11-08 16:47:24 -08:00
George Melikov	877d925a9e	Update zfs_admin_snapshot value (disabled) It's disabled by default, update code and tests to reflect the documentation. Minor cleanup in delegate_common.kshlib. Reviewed-by: Gregor Kopka <gregor@kopka.net> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #7835 Closes #8045	2018-11-08 16:17:12 -08:00
Tom Caputi	20eb30d08e	Fix divide by zero during indirect split damage This patch simply ensures that vdev_indirect_splits_damage() cannot hit a divide by zero exception if a split has no children with valid data. The normal reconstruction code path in vdev_indirect_reconstruct_io_done() already has this check. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8086	2018-11-07 15:44:56 -08:00
Tom Caputi	fde25c0a87	Fix dirtying vdev config on with RO spa This patch simply corrects an issue where vdev_dtl_reassess() could attempt to dirty the vdev config even when the spa was not elligable for writing. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8085	2018-11-07 15:44:14 -08:00
Tom Caputi	f44ad9297d	Replay logs before starting ztest workers This patch ensures that logs are replayed on all datasets prior to starting ztest workers. This ensures that the call to vdev_offline() a log device in ztest_fault_inject() will not fail due to the log device being required for replay. This patch also fixes a small issue found during testing where spa_keystore_load_wkey() does not check that the dataset specified is an encryption root. This check was present in libzfs, however. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8084	2018-11-07 15:40:24 -08:00
Tom Caputi	ac53e50f79	Fix vdev removal finishing race This patch fixes a race condition where the end of vdev_remove_replace_with_indirect(), which holds svr_lock, would race against spa_vdev_removal_destroy(), which destroys the same lock and is called asynchronously via dsl_sync_task_nowait(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #6900 Closes #8083	2018-11-07 15:38:10 -08:00

1 2 3 4 5 ...

2009 Commits