mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 08:51:30 +03:00

Author	SHA1	Message	Date
jdike	4c98586daf	Fix lockdep recursive locking false positive in dbuf_destroy lockdep reports a possible recursive lock in dbuf_destroy. It is true that dbuf_destroy is acquiring the dn_dbufs_mtx on one dnode while holding it on another dnode. However, it is impossible for these to be the same dnode because, among other things,dbuf_destroy checks MUTEX_HELD before acquiring the mutex. This fix defines a class NESTED_SINGLE == 1 and changes that lock to call mutex_enter_nested with a subclass of NESTED_SINGLE. In order to make the userspace code compile, include/sys/zfs_context.h now defines mutex_enter_nested and NESTED_SINGLE. This is the lockdep report: [ 122.950921] ============================================ [ 122.950921] WARNING: possible recursive locking detected [ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O [ 122.950921] -------------------------------------------- [ 122.950921] dbu_evict/1457 is trying to acquire lock: [ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] but task is already holding lock: [ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] other info that might help us debug this: [ 122.950921] Possible unsafe locking scenario: [ 122.950921] CPU0 [ 122.950921] ---- [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] * DEADLOCK * [ 122.950921] May be due to missing lock nesting notation [ 122.950921] 1 lock held by dbu_evict/1457: [ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] stack backtrace: [ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1 [ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009 [ 122.950921] Call Trace: [ 122.950921] dump_stack+0x91/0xeb [ 122.950921] __lock_acquire+0x2ca7/0x4f10 [ 122.950921] lock_acquire+0x153/0x330 [ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs] [ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs] [ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs] [ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs] [ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs] [ 122.950921] taskq_thread+0x979/0x1480 [spl] [ 122.950921] kthread+0x2e7/0x3e0 [ 122.950921] ret_from_fork+0x27/0x50 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes #8984	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	2b9f73e5e6	Use zfsctl_snapshot_hold() wrapper zfs_refcount_() are to be wrapped by zfsctl_snapshot_() in this file. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9039	2019-09-25 11:27:49 -07:00
Brian Behlendorf	446d08fba4	Fix get_special_prop() build failure The cast of the size_t returned by strlcpy() to a uint64_t by the VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE is set. This is due to the additional hardening. Since the token is expected to always fit in strval the VERIFY3U has been removed. If somehow it doesn't, it will still be safely truncated. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8999 Closes #9020	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	73e50a7d5d	Drop redundant POSIX ACL check in zpl_init_acl() ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(), so no need to test again on POSIX ACL access. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9009	2019-09-25 11:27:49 -07:00
Brian Behlendorf	d751b12a9d	Export dnode symbols External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027	2019-09-25 11:27:49 -07:00
Tom Caputi	78831d4290	Ensure dsl_destroy_head() decrypts objsets This patch corrects a small issue where the dsl_destroy_head() code that runs when the async_destroy feature is disabled would not properly decrypt the dataset before beginning processing. If the dataset is not able to be decrypted, the optimization code now simply does not run and the dataset is completely destroyed in the DSL sync task. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9021	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	0a223246e1	Disable unused pathname::pn_path* (unneeded in Linux) struct pathname is originally from Solaris VFS, and it has been used in ZoL to merely call VOP from Linux VFS interface without API change, therefore pathname::pn_path* are unused and unneeded. Technically, struct pathname is a wrapper for C string in ZoL. Saves stack a bit on lookup and unlink. (#if0'd members instead of removing since comments refer to them.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9025	2019-09-25 11:27:49 -07:00
Nick Mattis	cf966cb19a	Fixes: #8934 Large kmem_alloc Large allocation over the spl_kmem_alloc_warn value was being performed. Switched to vmem_alloc interface as specified for large allocations. Changed the subsequent frees to match. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: nmattis <nickm970@gmail.com> Closes #8934 Closes #9011	2019-09-25 11:27:49 -07:00
Tom Caputi	1f72a18f59	Remove VERIFY from dsl_dataset_crypt_stats() This patch fixes an issue where dsl_dataset_crypt_stats() would VERIFY that it was able to hold the encryption root. This function should instead silently continue without populating the related field in the nvlist, as is the convention for this code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8976	2019-09-25 11:27:49 -07:00
Paul Zuchowski	14a11bf2f6	Improve "Unable to automount" error message. Having the mountpoint and dataset name both in the message made it confusing to read. Additionally, convert this to a zfs_dbgmsg rather than sending it to the console. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #8959	2019-09-25 11:27:49 -07:00
Brian Behlendorf	7a03d7c73c	Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY is set. Therefore, the EQUIV check must be wrapped accordingly. For the same reason the ASSERT in arc_buf_fill() in unsafe. However, since it's largely redundant it has simply been removed. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8979	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	093bb64461	Don't use d_path() for automount mount point for chroot'd process Chroot'd process fails to automount snapshots due to realpath(3) failure in mount.zfs(8). Construct a mount point path from sb of the ctldir inode and dirent name, instead of from d_path(), so that chroot'd process doesn't get affected by its view of fs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8903 Closes #8966	2019-09-25 11:27:48 -07:00
George Wilson	7d2489cfad	nopwrites on dmu_sync-ed blocks can result in a panic After device removal, performing nopwrites on a dmu_sync-ed block will result in a panic. This panic can show up in two ways: 1. an attempt to issue an IOCTL in vdev_indirect_io_start() 2. a failed comparison of zio->io_bp and zio->io_bp_orig in zio_done() To resolve both of these panics, nopwrites of blocks on indirect vdevs should be ignored and new allocations should be performed on concrete vdevs. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #8957	2019-09-25 11:27:48 -07:00
Alexander Motin	04d4df89f4	Avoid extra taskq_dispatch() calls by DMU DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do. This change adds check for empty sublists to avoid this. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8909	2019-09-25 11:27:48 -07:00
Tom Caputi	bfe5f029cf	Fix error message on promoting encrypted dataset This patch corrects the error message reported when attempting to promote a dataset outside of its encryption root. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8905 Closes #8935	2019-09-25 11:27:48 -07:00
Matthew Ahrens	7d64595c25	dn_struct_rwlock can not be held in dmu_tx_try_assign() The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock while assigning the tx, because this can lead to deadlock. Specifically, if this dnode is already assigned to an earlier txg, this thread may need to wait for that txg to sync (the ERESTART case below). The other thread that has assigned this dnode to an earlier txg prevents this txg from syncing until its tx can complete (calling dmu_tx_commit()), but it may need to acquire the dn_struct_rwlock to do so (e.g. via dmu_buf_hold*()). This commit adds an assertion to dmu_tx_try_assign() to ensure that this deadlock is not inadvertently introduced. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8929	2019-09-25 11:27:48 -07:00
Paul Dagnelie	1fd28bd8d4	Add SCSI_PASSTHROUGH to zvols to enable UNMAP support When exporting ZVOLs as SCSI LUNs, by default Windows will not issue them UNMAP commands. This reduces storage efficiency in many cases. We add the SCSI_PASSTHROUGH flag to the zvol's device queue, which lets the SCSI target logic know that it can handle SCSI commands. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8933	2019-09-25 11:27:48 -07:00
Tomohiro Kusumi	ab24c9cd4c	Prevent pointer to an out-of-scope local variable `show_str` could be a pointer to a local variable in stack which is out-of-scope by the time `return (snprintf(buf, buflen, "%s\n", show_str));` is called. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8924 Closes #8940	2019-09-25 11:27:48 -07:00
Matthew Ahrens	3c2a42fd25	dedup=verify doesn't clear the blkptr's dedup flag The logic to handle strong checksum collisions where the data doesn't match is incorrect. It is not clearing the dedup bit of the blkptr, which can cause a panic later in zio_ddt_free() due to the dedup table not matching what is in the blkptr. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-48097 Closes #8936	2019-09-25 11:27:48 -07:00
Igor K	9af524b0ee	Update vdev_ops_t from illumos Align vdev_ops_t from illumos for better compatibility. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #8925	2019-09-25 11:27:48 -07:00
Tom Caputi	b96ceeead2	Allow unencrypted children of encrypted datasets When encryption was first added to ZFS, we made a decision to prevent users from creating unencrypted children of encrypted datasets. The idea was to prevent users from inadvertently leaving some of their data unencrypted. However, since the release of 0.8.0, some legitimate reasons have been brought up for this behavior to be allowed. This patch simply removes this limitation from all code paths that had checks for it and updates the tests accordingly. Reviewed-by: Jason King <jason.king@joyent.com> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8737 Closes #8870	2019-09-25 11:27:48 -07:00
Alexander Motin	ed7b0d357a	Minimize aggsum_compare(&arc_size, arc_c) calls. For busy ARC situation when arc_size close to arc_c is desired. But then it is quite likely that aggsum_compare(&arc_size, arc_c) will need to flush per-CPU buckets to find exact comparison result. Doing that often in a hot path penalizes whole idea of aggsum usage there, since it replaces few simple atomic additions with dozens of lock acquisitions. Replacing aggsum_compare() with aggsum_upper_bound() in code increasing arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles allows to save ~5% of CPU time in aggsum code during sequential write to 12 ZVOLs with 16KB block size on large dual-socket system. I suppose there some minor arc_p behavior change due to lower precision of the new code, but I don't think it is a big deal, since it should affect only very small window in time (aggsum buckets are flushed every second) and in ARC size (buckets are limited to 10 average ARC blocks per CPU). Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8901	2019-09-25 11:27:47 -07:00
Matthew Ahrens	6083f40387	panic in removal_remap test on 4K devices If the zfs_remove_max_segment tunable is changed to be not a multiple of the sector size, then the device removal code will malfunction and try to create mappings that are smaller than one sector, leading to a panic. On debug bits this assertion will fail in spa_vdev_copy_segment(): ASSERT3U(DVA_GET_ASIZE(&dst), ==, size); On nondebug, the system panics with a stack like: metaslab_free_concrete() metaslab_free_impl() metaslab_free_impl_cb() vdev_indirect_remap() free_from_removing_vdev() metaslab_free_impl() metaslab_free_dva() metaslab_free() Fortunately, the default for zfs_remove_max_segment is 1MB, so this can't occur by default. We hit it during this test because removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on 4KB-sector disks, we hit the bug. This change makes the zfs_remove_max_segment tunable more robust, automatically rounding it up to a multiple of the sector size. We also turn some key assertions into VERIFY's so that similar bugs would be caught before they are encoded on disk (and thus avoid a panic-reboot-loop). Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-61342 Closes #8893	2019-09-25 11:27:47 -07:00
Matthew Ahrens	592ee2e6dd	compress metadata in later sync passes Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable compression (including of metadata). Ostensibly this helps the sync passes to converge (i.e. for a sync pass to not need to allocate anything because it is 100% overwrites). However, in practice it increases the average number of sync passes, because when we turn compression off, a lot of block's size will change and thus we have to re-allocate (not overwrite) them. It also increases the number of 128KB allocations (e.g. for indirect blocks and spacemaps) because these will not be compressed. The 128K allocations are especially detrimental to performance on highly fragmented systems, which may have very few free segments of this size, and may need to load new metaslabs to satisfy 128K allocations. We should increase zfs_sync_pass_dont_compress. In practice on a highly fragmented system we see a few 5-pass txg's, a tiny number of 6-pass txg's, and no txg's with more than 6 passes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-63431 Closes #8892	2019-09-25 11:27:47 -07:00
Alexander Motin	cab7d856ea	Move write aggregation memory copy out of vq_lock Memory copy is too heavy operation to do under the congested lock. Moving it out reduces congestion by many times to almost invisible. Since the original zio removed from the queue, and the child zio is not executed yet, I don't see why would the copy need protection. My guess it just remained like this from the time when lock was not dropped here, which was added later to fix lock ordering issue. Multi-threaded sequential write tests with both HDD and SSD pools with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major reduction of lock congestion, saving from 15% to 35% of CPU time and increasing throughput from 10% to 40%. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8890	2019-09-25 11:27:47 -07:00
Matthew Ahrens	77e64c6fff	ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread() When running zloop, we occasionally see the following crash: dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0) ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3] The error value 0x1c is ENOSPC. The transaction used by spa_vdev_remove_thread() should not be able to fail due to being out of space. i.e. we should not call dmu_tx_hold_space(). This will allow the removal thread to schedule its work even when the pool is low on space. The "slop space" will provide enough free space to sync out the txg. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-37853 Closes #8889	2019-09-25 11:27:47 -07:00
Tomohiro Kusumi	4f809bddc6	Fix lockdep warning on insmod sysfs_attr_init() is required to make lockdep happy for dynamically allocated sysfs attributes. This fixed #8868 on Fedora 29 running kernel-debug. This requirement was introduced in 2.6.34. See include/linux/sysfs.h for what it actually does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8868 Closes #8884	2019-09-25 11:27:47 -07:00
Matthew Ahrens	516a08ebb4	fat zap should prefetch when iterating When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-58347 Closes #8862	2019-09-25 11:27:47 -07:00
Matthew Ahrens	812c36fc71	Target ARC size can get reduced to arc_c_min Sometimes the target ARC size is reduced to arc_c_min, which impacts performance. We've seen this happen as part of the random_reads performance regression test, where the ARC size is reduced before the reads test starts which impacts how long it takes for system to reach good IOPS performance. We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE, and arc_available_memory() is less than arc_c>>arc_shrink_shift. However, arc_available_memory() could easily be low, even when arc_c is low, because we can have tons of unused bufs in the abd kmem cache. This would be especially true just after the DMU requests a bunch of stuff be evicted from the ARC (e.g. due to "zpool export"). To fix this, the ARC should reduce arc_c by the requested amount, not all the way down to arc_size (or arc_c_min), which can be very small. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-59431 Closes #8864	2019-09-25 11:27:47 -07:00
bnjf	fe11968bbf	Fix typo in vdev_raidz_math.c Fix typo in vdev_raidz_math.c Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brad Forschinger <github@bnjf.id.au> Closes #8875 Closes #8880	2019-09-25 11:27:47 -07:00
Paul Dagnelie	6f7bc75825	Allow metaslab to be unloaded even when not freed from On large systems, the memory used by loaded metaslabs can become a concern. While range trees are a fairly efficient data structure, on heavily fragmented pools they can still consume a significant amount of memory. This problem is amplified when we fail to unload metaslabs that we aren't using. Currently, we only unload a metaslab during metaslab_sync_done; in order for that function to be called on a given metaslab in a given txg, we have to have dirtied that metaslab in that txg. If the dirtying was the result of an allocation, we wouldn't be unloading it (since it wouldn't be 8 txgs since it was selected), so in effect we only unload a metaslab during txgs where it's being freed from. We move the unload logic from sync_done to a new function, and call that function on all metaslabs in a given vdev during vdev_sync_done(). Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8837	2019-09-25 11:27:47 -07:00
Allan Jude	60cbc18136	l2arc_apply_transforms: Fix typo in comment Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #8822	2019-09-25 11:27:46 -07:00
Serapheim Dimitropoulos	b63ed49c29	Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8859	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	fafe72712a	Drop objid argument in zfs_znode_alloc() (sync with OpenZFS) Since zfs_znode_alloc() already takes dmu_buf_t*, taking another uint64_t argument for objid is redundant. inode's ->i_ino does and needs to match znode's ->z_id. zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument since vnode doesn't have vnode# in VFS (hence ->z_id exists). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8841	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	328c95e391	Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod) Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8826	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	e5a877c5d0	Update descriptions for vnops These descriptions are not uptodate with the code. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8767	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	4933b0a25b	Drop local definition of MOUNT_BUSY It's accessible via <sys/mntent.h>. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8765	2019-09-25 11:27:46 -07:00
Alexander Motin	72888812b0	Fix comparison signedness in arc_is_overflowing() When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8873	2019-06-11 10:45:23 -07:00
Tom Caputi	581c77e725	Fix incorrect error message for raw receive This patch fixes an incorrect error message that comes up when doing a non-forcing, raw, incremental receive into a dataset that has a newer snapshot than the "from" snapshot. In this case, the current code prints a confusing message about an IVset guid mismatch. This functionality is supported by non-raw receives as an undocumented feature, but was never supported by the raw receive code. If this is desired in the future, we can probably figure out a way to make it work. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8758 Closes #8863	2019-06-11 10:45:17 -07:00
Tom Caputi	8dc8bbde6e	Reinstate raw receive check when truncating This patch re-adds a check that was removed in `369aa50`. The check confirms that a raw receive is not occuring before truncating an object's dn_maxblkid. At the time, it was believed that all cases that would hit this code path would be handled in other places, but that was not the case. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8852 Closes #8857	2019-06-07 12:45:40 -07:00
Tom Caputi	35050ef39e	Fix integer overflow of ZTOI(zp)->i_generation The ZFS on-disk format stores each inode's generation ID as a 64 bit number on disk and in-core. However, the Linux kernel's inode is only a 32 bit number. In most places, the code handles this correctly, but the cast is missing in zfs_rezget(). For many pools, this isn't an issue since the generation ID is computed as the current txg when the inode is created and many pools don't have more than 2^32 txgs. For the pools that have more txgs, this issue causes any inode with a high enough generation number to report IO errors after a call to "zfs rollback" while holding the file or directory open. This patch simply adds the missing cast. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8858	2019-06-07 12:45:40 -07:00
DeHackEd	58b2de6420	Wait in 'S' state when send/recv pipe is blocking Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #8733 Closes #8752	2019-06-07 12:45:40 -07:00
TulsiJain	11ad06d1d8	Make zfs_async_block_max_blocks handle zero correctly Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: TulsiJain <tulsi.jain@delphix.com> Closes #8829 Closes #8289	2019-06-07 12:45:40 -07:00
Brian Behlendorf	4f8eef29e0	Revert "Report holes when there are only metadata changes" This reverts commit `ec4f9b8f30` which introduced a narrow race which can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve the issue by revering this change to restore the previous behavior which depends solely on checking the dirty list. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8816 Closes #8834	2019-06-07 12:45:40 -07:00
Brian Behlendorf	a1eaf0dde0	Exclude log device ashift from normal class When opening a log device during import its allocation bias will not yet have been set by vdev_load(). This results in the log device's ashift being incorrectly applied to the maximum ashift of the vdevs in the normal class. Which in turn prevents the removal of any top-level devices due to the ashift check in the spa_vdev_remove_top_check() function. This issue is resolved by including vdev_islog in the check since it will be set correctly during vdev_open(). Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8735	2019-06-07 12:45:40 -07:00
madz	580256045b	Fix integer overflow in get_next_chunk() dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr> Closes #8778 Closes #8797	2019-06-07 12:45:40 -07:00
loli10K	aaf3b30dcf	Double-free of encryption wrapping key due to invalid pool properties This commits fixes a double-free in zfs_ioc_pool_create() triggered by specifying an unsupported combination of properties when creating a pool with encryption enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8791	2019-06-07 12:45:40 -07:00
Tom Caputi	69ae34076f	Fix embedded bp accounting in count_block() Currently, count_block() does not correctly account for the possibility that the bp that is passed to it could be embedded. These blocks shouldn't be counted since the work of scanning these blocks in already handled when the containing block is scanned. This patch simply resolves this issue by returning early in this case. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Bill Sommerfeld <sommerfeld@alum.mit.edu> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8800 Closes #8766	2019-06-07 12:39:13 -07:00
Tomohiro Kusumi	2fb37bcadd	Linux 5.2 compat: Directly call wait_on_page_bit() wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd. Directly call wait_on_page_bit() without using wait_on_page_writeback() interface, given zfs_putpage() is the only caller for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8794	2019-06-07 12:39:13 -07:00
loli10K	f91e7e6284	Device removal panics on 32-bit systems The issue is caused by an incorrect usage of the sizeof() operator in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on 32-bit systems pointers are 4 bytes long which is not supported by zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system will cause the following failure: VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22) PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete() Showing stack for process 1315 CPU: 6 PID: 1315 Comm: txg_sync Tainted: P OE 4.4.69+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76 Call Trace: [<>] dump_stack+0x58/0x7c [<>] spl_dumpstack+0x23/0x27 [spl] [<>] spl_panic.cold.0+0x5/0x41 [spl] [<>] ? dbuf_rele+0x3e/0x90 [zfs] [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs] [<>] ? zap_lookup+0x57/0x70 [zfs] [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs] [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs] [<>] ? txg_verify+0x1d/0x160 [zfs] [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs] [<>] vdev_sync+0xbf/0x550 [zfs] [<>] ? mutex_lock+0x10/0x30 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs] [<>] ? zap_contains+0x4d/0x70 [zfs] [<>] spa_sync+0x9f1/0x1b10 [zfs] ... [<>] ? kthread_stop+0x110/0x110 This commit simply corrects the "integer_size" parameter used to lookup the vdev's ZAP object. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8790	2019-06-07 12:39:13 -07:00

1 2 3 4 5 ...

2050 Commits