mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Paul Zuchowski	4d658bda32	zio_decompress_data always ASSERTs successful decompression This interferes with zdb_read_block trying all the decompression algorithms when the 'd' flag is specified, as some are expected to fail. Also control the output when guessing algorithms, try the more common compression types first, allow specifying lsize/psize, and fix an uninitialized variable. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9612 Closes #9630	2020-01-22 13:49:05 -08:00
Matthew Macy	d2233a08fa	Exclude data from cores unconditionally and metadata conditionally This change allows us to align the code dump logic across platforms. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9691	2020-01-22 13:49:05 -08:00
Brian Behlendorf	2525b71c68	ZTS: Fix zpool_reopen_001_pos Update the vdev_disk_open() retry logic to use a specified number of milliseconds to be more robust. Additionally, on failure log both the time waited and requested timeout to the internal log. The default maximum allowed open retry time has been increased from 500ms to 1000ms. Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9680 Conflicts:	2020-01-22 13:49:05 -08:00
Alexander Motin	388ef045b2	Fix use-after-free in case of L2ARC prefetch failure In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO to read data from the original storage device. Unfortunately pointer to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if some other read try to bump the ZIO priority, it will crash. The problem is reproducible by corrupting L2ARC content and reading some data with prefetch if l2arc_noprefetch tunable is changed to 0. With the default setting the issue is probably not reproducible now. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #9648	2020-01-22 13:49:05 -08:00
Brian Behlendorf	36fe63042c	Remove zfs_vdev_elevator module option As described in commit `f81d5ef6` the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9417 Closes #9609	2020-01-22 13:49:04 -08:00
Mauricio Faria de Oliveira	bc21c56c2d	Check for unlinked znodes after igrab() The changes in commit `41e1aa2a` / PR #9583 introduced a regression on tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started to fail with errno ENODATA: openat(AT_FDCWD, "/test", O_RDWR\|O_TMPFILE, 0666) = 3 <...> fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA The originally proposed change on PR #9583 is not susceptible to it, so just move the code/if-checks around back in that way, to fix it. Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes #9602	2020-01-22 13:49:04 -08:00
Heitor Alves de Siqueira	20e124dd71	Break out of zfs_zget early if unlinked znode If zp->z_unlinked is set, we're working with a znode that has been marked for deletion. If that's the case, we can skip the "goto again" loop and return ENOENT, as the znode should not be discovered. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Closes #9583	2020-01-22 13:49:04 -08:00
loli10K	880a37aa35	Prevent NULL pointer dereference in blkg_tryget() on EL8 kernels blkg_tryget() as shipped in EL8 kernels does not seem to handle NULL @blkg as input; this is different from its mainline counterpart where NULL is accepted. To prevent dereferencing a NULL pointer when dealing with block devices which do not set a root_blkg on the request queue perform the NULL check in vdev_bio_associate_blkg(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9546 Closes #9577	2020-01-22 13:49:04 -08:00
Alexander Motin	edaec84225	Improve logging of 128KB writes Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #9409	2020-01-22 13:49:04 -08:00
Matthew Macy	ca0f9b7473	Include prototypes for vdev_initialize Address two prototype related warnings emitted by clang. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9535	2020-01-22 13:49:03 -08:00
Tom Caputi	7e1b772edd	Fix 'zfs change-key' with unencrypted child Currently, when you call 'zfs change-key' on an encrypted dataset that has an unencrypted child, the code will trigger a VERIFY. This VERIFY is leftover from before we allowed unencrypted datasets to exist underneath encrypted ones. This patch fixes the issue by simply replacing the VERIFY with an early return when recursing through datasets. Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9524	2020-01-22 13:49:03 -08:00
loli10K	f1ba5478a3	Fix pool creation with feature@allocation_classes disabled When "feature@allocation_classes" is not enabled on the pool no vdev with "special" or "dedup" allocation type should be allowed to exist in the vdev tree. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9427 Closes #9429	2020-01-22 13:49:02 -08:00
Brian Behlendorf	5a1bf9e8b1	Fix automount for root filesystems Commit `093bb64` resolved an automount failures for chroot'd processes but inadvertently broke automounting for root filesystems where the vfs_mntpoint is NULL. Resolve the issue by checking for NULL in order to generate the correct path. Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9381 Closes #9384	2020-01-22 13:49:02 -08:00
Matthew Macy	b43893de86	Rename rangelock_ functions to zfs_rangelock_ A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9402	2020-01-22 13:49:02 -08:00
Brian Behlendorf	62c034f6d4	Linux 5.0 compat: SIMD compatibility Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This commit squashes the following commits from master in to a single commit which can be applied to 0.8.2. `10fa2545` - Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore `b88ca2ac` - Enable SIMD for encryption `095b5412` - Fix CONFIG_X86_DEBUG_FPU build failure `e5db3134` - Linux 5.0 compat: SIMD compatibility Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes"	2020-01-22 13:49:01 -08:00
Brian Behlendorf	055238d2eb	Add warning for zfs_vdev_elevator option removal Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. While well intentioned this introduced a bug which could cause a system hang, that issue was subsequently fixed by commit `2a0d4188`. In order to avoid future bugs in this area, and to simplify the code, this functionality is being deprecated. A console warning has been added to notify any existing consumers and the documentation updated accordingly. This option will remain for the lifetime of the 0.8.x series for compatibility but if planned to be phased out of master. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9317	2020-01-22 13:49:01 -08:00
loli10K	ec5d76e853	diff_cb() does not handle large dnodes Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try to access its interior slots when dnodesize > sizeof(dnode_phys_t). This is normally not an issue because the interior slots are zero-filled, which report_dnode() handles calling report_free_dnode_range(). However this is not the case for encrypted large dnodes or filesystem using many SA based xattrs where the extra data past the legacy dnode size boundary is interpreted as a dnode_phys_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7678 Closes #8931 Closes #9343	2020-01-22 13:49:01 -08:00
loli10K	444df1051c	Device removal of indirect vdev panics the kernel This commit fixes a NULL pointer dereference triggered in spa_vdev_remove_top_check() by trying to "zpool remove" an indirect vdev. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9327	2020-01-22 13:49:01 -08:00
Tom Caputi	5986c5c687	Fix clone handling with encryption roots Currently, spa_keystore_change_key_sync_impl() does not recurse into clones when updating encryption roots for either a call to 'zfs promote' or 'zfs change-key'. This can cause children of these clones to end up in a state where they point to the wrong dataset as the encryption root. It can also trigger ASSERTs in some cases where the code checks reference counts on wrapping keys. This patch fixes this issue by ensuring that this function properly recurses into clones during processing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9267 Closes #9294	2020-01-22 13:49:00 -08:00
Tom Caputi	8747ee4513	Fix stalled txg with repeated noop scans Currently, the DSL scan code figures out when it should suspend processing and allow a txg to continue by calling the function dsl_scan_check_suspend(). Unfortunately, this function only allows the scan to suspend at a level 0 block. In the event that the system is scanning a bunch of empty snapshots or a resilver is running with a high enough scn_cur_min_txg, the scan will stop processing each dataset at the root level, deciding it has nothing left to do. This means that the check_suspend function is never called and the txg remains stuck until a dataset is found that has data to scan. This patch fixes the problem by allowing scans to suspend at the root level of the objset. For backwards compatibility, we use the bookmark <objsetid, 0, 0, 0> when we suspend here so that older versions of the code will work as intended. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9300	2020-01-22 13:49:00 -08:00
Igor K	5cb46afcf1	Fix panic on DilOS with kstat per dataset statistics Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value is ignored in the Linux kstat code but resolves the issue for other platforms. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #9254 Closes #9151	2020-01-22 13:49:00 -08:00
George Wilson	7e93917309	maxinflight can overflow in spa_load_verify_cb() When running on larger memory systems, we can overflow the value of maxinflight. This can result in maxinflight having a value of 0 causing the system to hang. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #9272	2020-01-22 13:49:00 -08:00
Andrea Gelmini	5097eb6ac9	Fix typos in module/zfs/ Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #9240	2020-01-22 13:48:59 -08:00
Paul Dagnelie	ebdb770554	Prevent metaslab_sync panic due to spa_final_dirty_txg If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being exported, we can enter a situation that causes a kernel panic. Any metaslabs that are loaded during the final dirty txg and haven't already been condensed will cause metaslab_sync to proceed after the final dirty txg so that the condense can be performed, which there are assertions to prevent. Because of the nature of this issue, there are a number of ways we can enter this state. Rather than try to prevent each of them one by one, potentially missing some edge cases, we instead cut it off at the point of intersection; by preventing metaslab_sync from proceeding if it would only do so to perform a condense and we're past the final dirty txg, we preserve the utility of the existing asserts while preventing this particular issue. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9185 Closes #9186 Closes #9231 Closes #9253	2020-01-22 13:48:58 -08:00
Tony Nguyen	16f42e1b6d	Use smaller default slack/delta value for schedule_hrtimeout_range() For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta for calls to schedule_hrtimeout_range(). This 100us slack can be costly for small writes. This change improves small write performance by passing resolution `res` parameter to schedule_hrtimeout_range() to be used as delta/slack. A new tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old behavior when desired. Performance observations on 8K recordsize filesystem: - 8K random writes at 1-64 threads, up to 60% improvement for one thread and smaller gains as thread count increases. At >64 threads, 2-5% decrease in performance was observed. - 8K sequential writes, similar 60% improvement for one thread and leveling out around 64 threads. At >64 threads, 5-10% decrease in performance was observed. - 128K sequential write sees 1-5 for the 128K. No observed regression at high thread count. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #9217	2020-01-22 13:48:58 -08:00
jdike	db58aa717b	Fix lockdep circular locking false positive involving sa_lock There are two different deadlock scenarios, but they share a common link, which is thread 1 holding sa_lock and trying to get zap->zap_rwlock: zap_lockdir_impl+0x858/0x16c0 [zfs] zap_lockdir+0xd2/0x100 [zfs] zap_lookup_norm+0x7f/0x100 [zfs] zap_lookup+0x12/0x20 [zfs] sa_setup+0x902/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] and thread 2 trying to get sa_lock, either in sa_setup: sa_setup+0x742/0x1380 [zfs] zfsvfs_init+0x3d6/0xb20 [zfs] zfsvfs_create+0x5dd/0x900 [zfs] zfs_domount+0xa3/0xe20 [zfs] or in sa_build_index: sa_build_index+0x13d/0x790 [zfs] sa_handle_get_from_db+0x368/0x500 [zfs] zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs] zfs_znode_alloc+0x3da/0x1a40 [zfs] zfs_zget+0x39a/0x6e0 [zfs] zfs_root+0x101/0x160 [zfs] zfs_domount+0x91f/0xea0 [zfs] From there, there are different locking paths back to something holding zap->zap_rwlock. The deadlock scenarios involve multiple different ZFS filesystems being mounted. sa_lock is common to these scenarios, and the sa struct involved is private to a mount. Therefore, these must be referring to different sa_lock instances and these deadlocks can't occur in practice. The fix, from Brian Behlendorf, is to remove sa_lock from lockdep coverage by initializing it with MUTEX_NOLOCKDEP. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes #9110	2020-01-22 13:48:57 -08:00
Serapheim Dimitropoulos	dfa4d3d986	dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta() Even though the bug's writeup (Github issue #9136) is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on filled issue with the code. From the fact that `dp_dirty_total` was 0 (which is less than `zfs_dirty_data_max`) we know that there was one thread that set it to 0 and then signaled one of the waiters of `dp_spaceavail_cv` [see `dsl_pool_dirty_delta()` which is also the only place that `dp_dirty_total` is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through `dsl_pool_dirty_data()`. Given that this function is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()` I can only think of two possible ways of the above scenario happening: [1] The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin with?). [2] The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both `dsl_pool_dirty_space()` and `dsl_pool_undirty_space()` return immediately when space is 0). In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever `dp_dirty_total` hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured. Attempting to mimic what Prakash was doing in the issue filed, I created a dataset with `sync=always` and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as we were done with the ZIOs of those writes (the reason being that `dbuf_write_physdone()` isn't be called as we go through the override code paths, and thus `dsl_pool_undirty_space()` is never called). As a result we'd have to wait until we get to `dsl_pool_sync()` where we zero out all dirty data accounting for the pool and the current TXG's metadata. In addition, as Matt noted and I later verified, the same issue would arise when using dedup. In both cases (sync & dedup) we shouldn't have to wait until `dsl_pool_sync()` zeros out the accounting data. According to the comment in that part of the code, the reasons why we do the zeroing, have nothing to do with what we observe: ```` /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); ```` Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable. Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through `dbuf_new_size()` we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view. For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not. For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in `dsl_pool_sync()` after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data. Another fix is that we now update the accounting explicitly for counting errors in `dbuf_write_done()`. Finally, `dbuf_new_size()` updates the accounted space of the appropriate dirty record correctly now. The problem is that we still don't know how the bug came up in the issue filled. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> External-issue: DLPX-47285 Closes #9137	2020-01-22 13:48:57 -08:00
Tony Nguyen	b78d32cc25	Improve write performance by using dmu_read_by_dnode() In zfs_log_write(), we can use dmu_read_by_dnode() rather than dmu_read() thus avoiding unnecessary dnode_hold() calls. We get a 2-5% performance gain for large sequential_writes tests, >=128K writes to files with recordsize=8K. Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on VMware ESX. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #9156	2020-01-22 13:48:57 -08:00
Serapheim Dimitropoulos	dd6d0bdbb3	Assert that a dnode's bonuslen never exceeds its recorded size This patch introduces an assertion that can catch pitfalls in development where there is a mismatch between the size of reads and writes between a *_phys structure and its respective in-core structure when bonus buffers are used. This debugging-aid should be complementary to the verification done by ztest in ztest_verify_dnode_bt(). A side to this patch is that we now clear out any extra bytes past a bonus buffer's new size when the buffer is shrinking. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8348	2020-01-22 13:48:57 -08:00
Paul Zuchowski	9d4ca81b6f	Make txg_wait_synced conditional in zfsvfs_teardown The call to txg_wait_synced in zfsvfs_teardown should be made conditional on the objset having dirty data. This can prevent unnecessary txg_wait_synced during some unmount operations. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9115	2020-01-22 13:48:57 -08:00
Paul Dagnelie	93fd9101c9	Prevent race in blkptr_verify against device removal When we check the vdev of the blkptr in zfs_blkptr_verify, we can run into a race condition where that vdev is temporarily unavailable. This happens when a device removal operation and the old vdev_t has been removed from the array, but the new indirect vdev has not yet been inserted. We hold the spa_config_lock while doing our sensitive verification. To ensure that we don't deadlock, we only grab the lock if we don't have config_writer held. In addition, I had to const the tags of the refcounts and the spa_config_lock arguments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9112	2020-01-22 13:48:57 -08:00
George Wilson	628fd31d26	spa_load_verify() may consume too much memory When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: George Wilson george.wilson@delphix.com External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146	2020-01-22 13:48:57 -08:00
Tomohiro Kusumi	d38e4ee142	Change boolean-like uint8_t fields in znode_t to boolean_t Given znode_t is an in-core structure, it's more readable to have them as boolean. Also co-locate existing boolean fields with them for space efficiency (expecting 8 booleans to be packed/aligned). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9092 Conflicts: include/sys/zfs_znode.h module/zfs/zfs_znode.c	2020-01-22 13:48:57 -08:00
DeHackEd	376ca4649b	Don't wakeup unnecessarily in 'zpool events -f' ZED can prevent CPU's from properly sleeping. Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #9091	2020-01-22 13:48:57 -08:00
jdike	77d59a6d63	lockdep false positive - move txg_kick() outside of ->dp_lock This fixes a lockdep warning by breaking a link between ->tx_sync_lock and ->dp_lock. The deadlock envisioned by lockdep is this: thread 1 holds db->db_mtx and tries to get dp->dp_lock: dsl_pool_dirty_space+0x70/0x2d0 [zfs] dbuf_dirty+0x778/0x31d0 [zfs] thread 2 holds bpo->bpo_lock and tries to get db->db_mtx: dmu_buf_will_dirty_impl dmu_buf_will_dirty+0x6b/0x6c0 [zfs] bpobj_iterate_impl+0xbe6/0x1410 [zfs] thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock: bpobj_space+0x63/0x470 [zfs] dsl_scan_active+0x340/0x3d0 [zfs] txg_sync_thread+0x3f2/0x1370 [zfs] thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock txg_kick+0x61/0x420 [zfs] dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs] This patch is orginally from Brian Behlendorf and slightly simplified by me. It breaks this cycle in thread 4 by moving the call from dsl_pool_need_dirty_delay to txg_kick outside the section controlled by dp->dp_lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes #9094	2020-01-22 13:48:57 -08:00
Paul Dagnelie	66c8b2f65a	Don't activate metaslabs with weight 0 We return ENOSPC in metaslab_activate if the metaslab has weight 0, to avoid activating a metaslab with no space available. For sanity checking, we also assert that there is no free space in the range tree in that case. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8968	2020-01-22 13:48:57 -08:00
Paul Dagnelie	350646563f	Concurrent small allocation defeats large allocation With the new parallel allocators scheme, there is a possibility for a problem where two threads, allocating from the same allocator at the same time, conflict with each other. There are two primary cases to worry about. First, another thread working on another allocator activates the same metaslab that the first thread was trying to activate. This results in the first thread needing to go back and reselect a new metaslab, even though it may have waited a long time for this metaslab to load. Second, another thread working on the same allocator may have activated a different metaslab while the first thread was waiting for its metaslab to load. Both of these cases can cause the first thread to be significantly delayed in issuing its IOs. The second case can also cause metaslab load/unload churn; because the metaslab is loaded but not fully activated, we never set the selected_txg, which results in the metaslab being immediately unloaded again. This process can repeat many times, wasting disk and cpu resources. This is more likely to happen when the IO of the first thread is a larger one (like a ZIL write) and the other thread is doing a smaller write, because it is more likely to find an acceptable metaslab quickly. There are two primary changes. The first is to always proceed with the allocation when returning from metaslab_activate if we were preempted in either of the ways described in the previous section. The second change is to set the selected_txg before we do the call to activate so that even if the metaslab is not used for an allocation, we won't immediately attempt to unload it. Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-61314 Closes #8843	2020-01-22 13:48:56 -08:00
loli10K	d47ee5ad1c	Fix bp_embedded_type enum definition With the addition of BP_EMBEDDED_TYPE_REDACTED in `30af21b0` a couple of codepaths make wrong assumptions and could potentially result in errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8951 Conflicts: include/sys/spa.h	2020-01-22 13:48:56 -08:00
Don Brady	e625030c11	OpenZFS 9425 - channel programs can be interrupted Problem Statement ================= ZFS Channel program scripts currently require a timeout, so that hung or long-running scripts return a timeout error instead of causing ZFS to get wedged. This limit can currently be set up to 100 million Lua instructions. Even with a limit in place, it would be desirable to have a sys admin (support engineer) be able to cancel a script that is taking a long time. Proposed Solution ================= Make it possible to abort a channel program by sending an interrupt signal.In the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to catch the signal. Once a signal is encountered, the dsl_sync_task function can install a Lua hook that will get called before the Lua interpreter executes a new line of code. The dsl_sync_task can resume with a standard txg_wait_sync call and wait for the txg to complete. Meanwhile, the hook will abort the script and indicate that the channel program was canceled. The kernel returns a EINTR to indicate that the channel program run was canceled. Porting notes: Added missing return value from cv_wait_sig() Authored by: Don Brady <don.brady@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/9425 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926 Closes #8904	2020-01-22 13:48:56 -08:00
Matthew Ahrens	cbb9154958	looping in metaslab_block_picker impacts performance on fragmented pools On fragmented pools with high-performance storage, the looping in metaslab_block_picker() can become the performance-limiting bottleneck. When looking for a larger block (e.g. a 128K block for the ZIL), we may search through many free segments (up to hundreds of thousands) to find one that is large enough to satisfy the allocation. This can take a long time (up to dozens of ms), and is done while holding the ms_lock, which other threads may spin waiting for. When this performance problem is encountered, profiling will show high CPU time in metaslab_block_picker, as well as in mutex_enter from various callers. The problem is very evident on a test system with a sync write workload with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage, 84% full and 88% fragmented. It has also been observed on production systems with 90TB of storage, 76% full and 87% fragmented. The fix is to change metaslab_df_alloc() to search only up to 16MB from the previous allocation (of this alignment). After that, we will pick a segment that is of the exact size requested (or larger). This reduces the number of iterations to a few hundred on fragmented pools (a ~100x improvement). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-62324 Closes #8877	2020-01-22 13:48:56 -08:00
Matthew Ahrens	8805abb8fc	single-chunk scatter ABDs can be treated as linear Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by `87c25d567`. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8580	2020-01-22 13:48:56 -08:00
Matthew Ahrens	bee5738f77	make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865	2020-01-22 13:48:56 -08:00
Kody A Kantor	c37fa0d5a8	Disabled resilver_defer feature leads to looping resilvers When a disk is replaced with another on a pool with the resilver_defer feature present, but not enabled the resilver activity restarts during each spa_sync. This patch checks to make sure that the resilver_defer feature is first enabled before requesting a deferred resilver. This was originally fixed in illumos-joyent as OS-7982. Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Signed-off-by: Kody A Kantor <kody@kkantor.com> External-issue: illumos-joyent OS-7982 Closes #9299 Closes #9338	2019-09-25 11:27:51 -07:00
Andriy Gapon	12a78fbb4f	Fix dsl_scan_ds_clone_swapped logic The was incorrect with respect to swapping dataset IDs both in the on-disk ZAP object and the in-memory queue. In both cases, if ds1 was already present, then it would be first replaced with ds2 and then ds would be replaced back with ds1. Also, both cases did not properly handle a situation where both ds1 and ds2 are already queued. A duplicate insertion would be attempted and its failure would result in a panic. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #9140 Closes #9163	2019-09-25 11:27:51 -07:00
loli10K	63d8f57fe7	Scrubbing root pools may deadlock on kernels without elevator_change() (#9321 ) Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. Unfortunately changing the evelator via usermodehelper requires reading some userland binaries, most notably modprobe(8) or sh(1), from a zfs dataset on systems with root-on-zfs. This can deadlock the system if used during the following call path because it may need, if the data is not already cached in the ARC, reading directly from disk while holding the spa config lock as a writer: zfs_ioc_pool_scan() -> spa_scan() -> spa_scan() -> vdev_reopen() -> vdev_elevator_switch() -> call_usermodehelper() While the usermodehelper waits sh(1), modprobe(8) is blocked in the ZIO pipeline trying to read from disk: INFO: task modprobe:2650 blocked for more than 10 seconds. Tainted: P OE 5.2.14 modprobe D 0 2650 206 0x00000000 Call Trace: ? __schedule+0x244/0x5f0 schedule+0x2f/0xa0 cv_wait_common+0x156/0x290 [spl] ? do_wait_intr_irq+0xb0/0xb0 spa_config_enter+0x13b/0x1e0 [zfs] zio_vdev_io_start+0x51d/0x590 [zfs] ? tsd_get_by_thread+0x3b/0x80 [spl] zio_nowait+0x142/0x2f0 [zfs] arc_read+0xb2d/0x19d0 [zfs] ... zpl_iter_read+0xfa/0x170 [zfs] new_sync_read+0x124/0x1b0 vfs_read+0x91/0x140 ksys_read+0x59/0xd0 do_syscall_64+0x4f/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This commit changes how we use the usermodehelper functionality from synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents scrubs, and other vdev_elevator_switch() consumers, from triggering the aforementioned issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Issue #8664 Closes #9321	2019-09-25 11:27:51 -07:00
Chengfei ZHu	9fa8b5b55b	QAT related bug fixes 1. Fix issue: Kernel BUG with QAT during decompression #9276. Now it is uninterruptible for a specific given QAT request, but Ctrl-C interrupt still works in user-space process. 2. Copy the digest result to the buffer only when doing encryption, and vise-versa for decryption. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com> Closes #9276 Closes #9303	2019-09-25 11:27:51 -07:00
Brian Behlendorf	97d4986214	Fix /etc/hostid on root pool deadlock Accidentally introduced by `dc04a8c` which now takes the SCL_VDEV lock as a reader in zfs_blkptr_verify(). A deadlock can occur if the /etc/hostid file resides on a dataset in the same pool. This is because reading the /etc/hostid file may occur while the caller is holding the SCL_VDEV lock as a writer. For example, to perform a `zpool attach` as shown in the abbreviated stack below. To resolve the issue we cache the system's hostid when initializing the spa_t, or when modifying the multihost property. The cached value is then relied upon for subsequent accesses. Call Trace: spa_config_enter+0x1e8/0x350 [zfs] zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock zio_read+0x6c/0x140 [zfs] ... vfs_read+0xfc/0x1e0 kernel_read+0x50/0x90 ... spa_get_hostid+0x1c/0x38 [zfs] spa_config_generate+0x1a0/0x610 [zfs] vdev_label_init+0xa0/0xc80 [zfs] vdev_create+0x98/0xe0 [zfs] spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9256 Closes #9285	2019-09-25 11:27:51 -07:00
Andriy Gapon	beb21db3c6	Always refuse receving non-resume stream when resume state exists This fixes a hole in the situation where the resume state is left from receiving a new dataset and, so, the state is set on the dataset itself (as opposed to %recv child). Additionally, distinguish incremental and resume streams in error messages. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #9252	2019-09-25 11:27:51 -07:00
loli10K	13e5e396a3	Fix Intel QAT / ZFS compatibility on v4.7.1+ kernels This change use the compat code introduced in `9cc1844a`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9268 Closes #9269	2019-09-25 11:27:51 -07:00
Chunwei Chen	c7a4255f12	Fix zil replay panic when TX_REMOVE followed by TX_CREATE If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7151 Closes #8910 Closes #9123 Closes #9145	2019-09-25 11:27:51 -07:00

1 2 3 4 5 ...

2111 Commits