mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Mateusz Guzik	f7a07d76ee	Retire z_nr_znodes Added in `ab26409db7` ("Linux 3.1 compat, super_block->s_shrink"), with the only consumer which needed the count getting retired in `066e825221` ("Linux compat: Minimum kernel version 3.10"). The counter gets in the way of not maintaining the list to begin with. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #15274	2023-09-19 08:52:06 -07:00
наб	0ce7a068e9	check-zstd-symbols: also ignore __pfx_ symbols Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b341b20d648bb7e9a3307c33163e7399f0913e66 Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #15282 Closes #15284	2023-09-19 08:52:06 -07:00
George Amanakis	11943656f9	Update the MOS directory on spa_upgrade_errlog() spa_upgrade_errlog() does not update the MOS directory when the head_errlog feature is enabled. In this case if spa_errlog_sync() is not called, the MOS dir references the old errlog_last and errlog_sync objects. Thus when doing a scrub a panic will occur: Call Trace: dump_stack+0x6d/0x8b panic+0x101/0x2e3 spl_panic+0xcf/0x102 [spl] delete_errlog+0x124/0x130 [zfs] spa_errlog_sync+0x256/0x260 [zfs] spa_sync_iterate_to_convergence+0xe5/0x250 [zfs] spa_sync+0x2f7/0x670 [zfs] txg_sync_thread+0x22d/0x2d0 [zfs] thread_generic_wrapper+0x83/0xa0 [spl] kthread+0x104/0x140 ret_from_fork+0x1f/0x40 Fix this by updating the related MOS directory objects in spa_upgrade_errlog(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #15279 Closes #15277	2023-09-19 08:51:00 -07:00
Andrea Righi	cacc599aa2	Linux 6.5 compat: spl: properly unregister sysctl entries When register_sysctl_table() is unavailable we fail to properly unregister sysctl entries under "kernel/spl". This leads to errors like the following when spl is unloaded/reloaded, making impossible to properly reload the spl module: [ 746.995704] sysctl duplicate entry: /kernel/spl/kmem/slab_kvmem_total Fix by cleaning up all the sub-entries inside "kernel/spl" when the spl module is unloaded. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Closes #15239	2023-09-19 08:50:01 -07:00
Andrea Righi	c7ee59a160	Linux 6.5 compat: safe cleanup in spl_proc_fini() If we fail to create a proc entry in spl_proc_init() we may end up calling unregister_sysctl_table() twice: one in the failure path of spl_proc_init() and another time during spl_proc_fini(). Avoid the double call to unregister_sysctl_table() and while at it refactor the code a bit to reduce code duplication. This was accidentally introduced when the spl code was updated for Linux 6.5 compatibility. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Closes #15234 Closes #15235	2023-09-19 08:50:01 -07:00
Coleman Kane	58a707375f	Linux 6.5 compat: Use copy_splice_read instead of filemap_splice_read Using the filemap_splice_read function for the splice_read handler was leading to occasional data corruption under certain circumstances. Favor using copy_splice_read instead, which does not demonstrate the same erroneous behavior under the tested failure cases. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15164	2023-09-19 08:50:01 -07:00
Coleman Kane	5a22de144a	Linux 6.5 compat: replace generic_file_splice_read with filemap_splice_read The generic_file_splice_read function was removed in Linux 6.5 in favor of filemap_splice_read. Add an autoconf test for filemap_splice_read and use it if it is found as the handler for .splice_read in the file_operations struct. Additionally, ITER_PIPE was removed in 6.5. This change removes the ITER_* macros that OpenZFS doesn't use from being tested in config/kernel-vfs-iov_iter.m4. The removal of ITER_PIPE was causing the test to fail, which also affected the code responsible for setting the .splice_read handler, above. That behavior caused run-time panics on Linux 6.5. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15155	2023-09-19 08:50:01 -07:00
Coleman Kane	31a4673c05	Linux 6.5 compat: register_sysctl_table removed Additionally, the .child element of ctl_table has been removed in 6.5. This change adds a new test for the pre-6.5 register_sysctl_table() function, and uses the old code in that case. If it isn't found, then the parentage entries in the tables are removed, and the register_sysctl call is provided the paths of "kernel/spl", "kernel/spl/kmem", and "kernel/spl/kstat" directly, to populate each subdirectory over three calls, as is the new API. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15138	2023-09-19 08:50:01 -07:00
Brian Atkinson	3a68f3c50f	Revert "Linux 6.5 compat: register_sysctl_table removed" This reverts commit `b35374fd64` as there are error messages when loading the SPL module. Errors seemed to be tied to duplicate a duplicate entry. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #15134	2023-09-19 08:50:01 -07:00
Coleman Kane	8be6308e85	Linux 4.20 compat: wrapper function for iov_iter type access An iov_iter_type() function to access the "type" member of the struct iov_iter was added at one point. Move the conditional logic to decide which method to use for accessing it into a macro and simplify the zpl_uio_init code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15100	2023-09-19 08:50:01 -07:00
Coleman Kane	0bf2c5365e	Linux 6.4 compat: iter_iov() function now used to get old iov member The iov_iter->iov member is now iov_iter->__iov and must be accessed via the accessor function iter_iov(). Create a wrapper that is conditionally compiled to use the access method appropriate for the target kernel version. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15100	2023-09-19 08:50:01 -07:00
Coleman Kane	d76de9fb17	Linux 6.5 compat: blkdev changes Multiple changes to the blkdev API were introduced in Linux 6.5. This includes passing (void* holder) to blkdev_put, adding a new blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type that replaces uses of fmode_t, and removing an argument from the release handler on block_device_operations that we weren't using. The open function definition has also changed to take gendisk* and blk_mode_t, so update it accordingly, too. Implement local wrappers for blkdev_get_by_path() and vdev_blkdev_put() so that the in-line calls are cleaner, and place the conditionally-compiled implementation details inside of both of these local wrappers. Both calls are exclusively used within vdev_disk.c, at this time. Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE The wrapper function is now used for testing using the appropriate method for the kernel, whether the open mode is writable or not. Emphasize fmode_t arg in zvol_release is not used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15099	2023-09-19 08:50:01 -07:00
Coleman Kane	6c2fc56916	Linux 6.5 compat: register_sysctl_table removed Additionally, the .child element of ctl_table has been removed in 6.5. This change adds a new test for the pre-6.5 register_sysctl_table() function, and uses the old code in that case. If it isn't found, then the parentage entries in the tables are removed, and the register_sysctl call is provided the paths of "kernel/spl", "kernel/spl/kmem", and "kernel/spl/kstat" directly, to populate each subdirectory over three calls, as is the new API. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15098	2023-09-19 08:50:01 -07:00
Alexander Motin	e96fbdba34	Add more constraints for block cloning. - We cannot clone into files with smaller block size if there is more than one block, since we can not grow the block size. - Block size must be power-of-2 if destination offset != 0, since there can be no multiple blocks of non-power-of-2 size. The first should handle the case when destination file has several blocks but still is not bigger than one block of the source file. The second fixes panic in dmu_buf_hold_array_by_dnode() on attempt to concatenate files with equal but non-power-of-2 block sizes. While there, assert that error is reported if we made no progress. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.	2023-09-10 14:02:52 -07:00
Volker Mauel	4da8c7d11e	Intel QAT 1.7 compatibility Based on the intel QAT samples which are bundled in the 1.x drivers, this is the preferred approach since api version 1.6. See: https://www.intel.de/content/www/de/de/download/19734/intel-quickassist-technology-driver-for-linux-hw-version-1-x.html? Reviewed-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Volker Mauel <volkermauel@gmail.com> Closes #15190	2023-09-07 16:10:52 -07:00
Alexander Motin	79ac1b29d5	ZIL: Change ZIOs issue order. In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio, we have no right to access lwb->lwb_child_zio. If it was not there, the first two ZIOs may have already completed and freed the lwb. ZIOs issue in opposite order from children to parent should keep the lwb valid till the end, since the lwb can be freed only after lwb_root_zio completion callback. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15233	2023-09-02 10:30:38 -07:00
Alexander Motin	7dc2baaa1f	ZIL: Revert zl_lock scope reduction. While I have no reports of it, I suspect possible use-after-free scenario when zil_commit_waiter() tries to dereference zcw_lwb for lwb already freed by zil_sync(), while zcw_done is not set. Extension of zl_lock scope as it was originally should block zil_sync() from freeing the lwb, closing this race. This reverts #14959 and couple chunks of #14841. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15228	2023-09-02 10:30:38 -07:00
Alexander Motin	5a7cb0b065	ZIL: Tune some assertions. In zil_free_lwb() we should first assert lwb_state or the rest of assertions can be misleading if it is false. Add lwb_state assertions in zil_lwb_add_block() to make sure we are not trying to add elements to lwb_vdev_tree after it was processed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15227	2023-09-02 10:30:38 -07:00
Dimitry Andric	400f56e3f8	dmu_buf_will_clone: change assertion to fix 32-bit compiler warning Building module/zfs/dbuf.c for 32-bit targets can result in a warning: In file included from /usr/src/sys/contrib/openzfs/include/sys/zfs_context.h:97, from /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:32: /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c: In function 'dmu_buf_will_clone': /usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:116:33: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] 116 \| const uint64_t __left = (uint64_t)(LEFT); \ \| ^ /usr/src/sys/contrib/openzfs/lib/libspl/include/assert.h:148:25: note: in expansion of macro 'VERIFY0' 148 \| #define ASSERT0 VERIFY0 \| ^~~~~~~ /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2704:9: note: in expansion of macro 'ASSERT0' 2704 \| ASSERT0(dbuf_find_dirty_eq(db, tx->tx_txg)); \| ^~~~~~~ This is because dbuf_find_dirty_eq() returns a pointer, which if pointers are 32-bit results in a warning about the cast to uint64_t. Instead, use the ASSERT3P() macro, with == and NULL as second and third arguments, which should work regardless of the target's bitness. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #15224	2023-09-01 09:33:33 -07:00
Serapheim Dimitropoulos	ab999406fe	Update outdated assertion from zio_write_compress As part of some internal gang block testing within Delphix we hit the assertion removed by this patch. The assertion was triggered by a ZIO that had two copies and was a gang block making the following expression equal to 3: ``` MIN(zp->zp_copies + BP_IS_GANG(bp), spa_max_replication(spa)) ``` and failing when we expected the above to be equal to `BP_GET_NDVAS(bp)`. The assertion is no longer valid since the following commit: ``` commit `14872aaa4f` Author: Matthew Ahrens <matthew.ahrens@delphix.com> Date: Mon Feb 6 09:37:06 2023 -0800 EIO caused by encryption + recursive gang ``` The above commit changed gang block headers so they can't have more than 2 copies but the assertion in question from this PR was never updated. Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #15180	2023-08-26 11:18:11 -07:00
Rob N	92f095a903	copy_file_range: fix fallback when source create on same txg In `019dea0a5` we removed the conversion from EAGAIN->EXDEV inside zfs_clone_range(), but forgot to add a test for EAGAIN to the copy_file_range() entry points to trigger fallback to a content copy. This commit fixes that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15170 Closes #15172	2023-08-25 13:33:40 -07:00
oromenahar	895cb689d3	zfs_clone_range should return a descriptive error codes Return the more descriptive error codes instead of `EXDEV` when the parameters don't match the requirements of the clone function. Updated the comments in `brt.c` accordingly. The first three errors are just invalid parameters, which zfs can not handle. The fourth error indicates that the block which should be cloned is created and cloned or modified in the same transaction group (`txg`). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15148	2023-08-25 13:33:40 -07:00
Mateusz Piotrowski	c418edf1d3	Fix some typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Closes #15141	2023-08-25 13:33:40 -07:00
Alexander Motin	df8c9f351d	ZIL: Second attempt to reduce scope of zl_issuer_lock. The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15122	2023-08-25 11:58:44 -07:00
Alexander Motin	bb31ded68b	ZIL: Replay blocks without next block pointer. If we get next block allocation error during log write, we trigger transaction commit. But the block we have just completed is still written and transactions it covers will be acknowledged normally. If after that we ignore the block during replay just because it is the last in the chain, we may not replay some transactions that we have acknowledged as synced, that is not right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15132	2023-08-25 11:58:44 -07:00
Alexander Motin	c1801cbe59	ZIL: Avoid dbuf_read() before dmu_sync(). In most cases dmu_sync() works with dirty records directly and does not need actual data. The only exception is dmu_sync_late_arrival(). To save some CPU time use dmu_buf_hold_noread() in z_get_data() and explicitly call dbuf_read() in dmu_sync_late_arrival(). There is also a chance that by that time TXG will already be synced and we won't have to do it at all. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15153	2023-08-25 11:58:44 -07:00
Alexander Motin	ffaedf0a44	Remove fastwrite mechanism. Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15107	2023-08-25 11:58:44 -07:00
Alexander Motin	02ce9030e6	Avoid waiting in dmu_sync_late_arrival(). The transaction there does not produce any dirty data or log blocks, so it should not be throttled. All other cases wait for TXG sync, by which time the log block we are writing will be obsolete, so we can skip waiting and just return error here instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15096	2023-08-25 11:58:44 -07:00
наб	bd1eab16eb	linux: zfs: ctldir: set [amc]time to snapshot's creation property If looking up a snapdir inode failed, hold pool config – hold the snapshot – get its creation property – release it – release it, then use that as the [amc]time in the allocated inode. If that fails then fall back to current time. No performance impact since this is only done when allocating a new snapdir inode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #15110 Closes #15117	2023-08-02 08:53:45 -07:00
Rob N	c47f0f4417	linux/copy_file_range: properly request a fallback copy on Linux <5.3 Before Linux 5.3, the filesystem's copy_file_range handler had to signal back to the kernel that we can't fulfill the request and it should fallback to a content copy. This is done by returning -EOPNOTSUPP. This commit converts the EXDEV return from zfs_clone_range to EOPNOTSUPP, to force the kernel to fallback for all the valid reasons it might be unable to clone. Without it the copy_file_range() syscall will return EXDEV to userspace, breaking its semantics. Add test for copy_file_range fallbacks. copy_file_range should always fallback to a content copy whenever ZFS can't service the request with cloning. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15131	2023-08-02 08:52:40 -07:00
Rob N	12f2b1f65e	zdb: include cloned blocks in block statistics This gives `zdb -b` support for clone blocks. Previously, it didn't know what clones were, so would count their space allocation multiple times and then report leaked space (or, in debug, would assert trying to claim blocks a second time). This commit fixes those bugs, and reports the number of clones and the space "used" (saved) by them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15123	2023-08-02 08:52:40 -07:00
oromenahar	c24a480631	BRT should return EOPNOTSUPP Return the more descriptive EOPNOTSUPP instead of EXDEV when the storage pool doesn't support block cloning. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15097	2023-07-27 16:11:54 -07:00
Rob Norris	2768dc04cc	linux: implement filesystem-side copy/clone functions for EL7 Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	3366ceaf3a	linux: implement filesystem-side clone ioctls Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	5d12545da8	linux: implement filesystem-side copy/clone functions This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	a3ea8c8ee6	dbuf_sync_leaf: check DB_READ in state assertions Block cloning introduced a new state transition from DB_NOFILL to DB_READ. This occurs when a block is cloned and then read on the current txg. In this case, the clone will move the dbuf to DB_NOFILL, and then the read will be issued for the overidden block pointer. If that read is still outstanding when it comes time to write, the dbuf will be in DB_READ, which is not handled by the checks in dbuf_sync_leaf, thus tripping the assertions. This updates those checks to allow DB_READ as a valid state iff the dirty record is for a BRT write and there is a override block pointer. This is a safe situation because the block already exists, so there's nothing that could change from underneath the read. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Original-patch-by: Kay Pedersen <mail@mkwg.de> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	0426e13271	dmu_buf_will_clone: only check that current txg is clean dbuf_undirty() will (correctly) only removed dirty records for the given (open) txg. If there is a dirty record for an earlier closed txg that has not been synced out yet, then db_dirty_records will still have entries on it, tripping the assertion. Instead, change the assertion to only consider the current txg. To some extent this is redundant, as its really just saying "did dbuf_undirty() work?", but it it doesn't hurt and accurately expresses our expectations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Original-patch-by: Kay Pedersen <mail@mkwg.de> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	8aa4f0f0fc	brt_vdev_realloc: use vmem_alloc for large allocation bv_entcount can be a relatively large allocation (see comment for BRT_RANGESIZE), so get it from the big allocator. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Rob Norris	7698503dca	zfs_clone_range: use vmem_malloc for large allocation Just silencing the warning about large allocations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-26 08:46:58 -07:00
Alexander Motin	991834f5dc	Remove zl_issuer_lock from zil_suspend(). This locking was recently added as part of #14979. But appears it is illegal to take zl_issuer_lock while holding dp_config_rwlock, taken by dsl_pool_hold(). It causes deadlock with sync thread in spa_sync_upgrades(). On a second thought, we should not need this locking, since zil_commit_impl() we call below takes zl_issuer_lock, that should sufficiently protect zl_suspend reads, combined with other logic from #14979. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15103	2023-07-25 13:54:02 -07:00
Alexander Motin	41a0f66279	ZIL: Fix config lock deadlock. When we have some LWBs closed and their ZIOs ready to be issued, we can not afford sleeping on config lock if somebody else try to lock it as writer, or it will cause a deadlock. To solve it, move spa_config_enter() from zil_lwb_write_issue() to zil_lwb_write_close() under zl_issuer_lock to enforce lock ordering with other threads. Now if we can't immediately lock config, issue all previously closed LWBs so that they could drop their config locks after completion, and only then allow sleeping on our lock. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15078 Closes #15080	2023-07-25 13:54:02 -07:00
Rob N	685ae4429f	metaslab: tuneable to better control force ganging metaslab_force_ganging isn't enough to actually force ganging, because it still only forces 3% of the time. This adds metaslab_force_ganging_pct so we can configure how often to force ganging. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15088	2023-07-21 16:35:12 -07:00
Alexander Motin	81be809a25	Adjust prefetch parameters. - Reduce maximum prefetch distance for 32bit platforms to 8MB as it was previously. Those systems didn't grow much probably, so better stay conservative there. - Retire array_rd_sz tunable, blocking prefetch for large requests. We should not penalize applications trying to be more efficient. The speculative prefetcher by itself has reasonable distance limits, and 1MB is not much at all these days. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15072	2023-07-21 16:35:12 -07:00
Alexander Motin	8a6fde8213	Add explicit prefetches to bpobj_iterate(). To simplify error handling bpobj_iterate_blkptrs() iterates through the list of block pointers backwards. Unfortunately speculative prefetcher is currently unable to detect such patterns, that makes each block read there synchronous and very slow on HDD pools. According to my tests, added explicit prefetch reduces time needed to asynchronously delete 8 snapshots of 4 million blocks each from 20 seconds to less than one, that should free sync thread for other useful work, such as async writes, scrub, etc. While there, plug one memory leak in case of bpobj_open() error and harmonize some variable names. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15071	2023-07-21 16:35:12 -07:00
Alan Somers	b6f618f8ff	Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #14717 Closes #15052	2023-07-21 16:35:12 -07:00
Alan Somers	51a2b59767	Don't emit checksum histograms in ereport.fs.zfs.checksum events The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15052	2023-07-21 16:35:12 -07:00
Chunwei Chen	b221f43943	Fix zpl_test_super race with zfs_umount We cannot call zpl_enter in zpl_test_super, because zpl_test_super is under spinlock so we can't sleep, and also because zpl_test_super is called without sb->s_umount taken, so it's possible we would race with zfs_umount and call zpl_enter on freed zfsvfs. Here's an stack trace when this happens: [ 2379.114837] VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.114845] PANIC at spl-condvar.c:497:__cv_broadcast() [ 2379.114854] Kernel panic - not syncing: VERIFY(cvp->cv_magic == CV_MAGIC) failed [ 2379.115012] Call Trace: [ 2379.115019] dump_stack+0x74/0x96 [ 2379.115024] panic+0x114/0x2f6 [ 2379.115035] spl_panic+0xcf/0xfc [spl] [ 2379.115477] __cv_broadcast+0x68/0xa0 [spl] [ 2379.115585] rrw_exit+0xb8/0x310 [zfs] [ 2379.115696] rrm_exit+0x4a/0x80 [zfs] [ 2379.115808] zpl_test_super+0xa9/0xd0 [zfs] [ 2379.115920] sget+0xd1/0x230 [ 2379.116033] zpl_mount+0xdc/0x230 [zfs] [ 2379.116037] legacy_get_tree+0x28/0x50 [ 2379.116039] vfs_get_tree+0x27/0xc0 [ 2379.116045] path_mount+0x2fe/0xa70 [ 2379.116048] do_mount+0x80/0xa0 [ 2379.116050] __x64_sys_mount+0x8b/0xe0 [ 2379.116052] do_syscall_64+0x35/0x50 [ 2379.116054] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 2379.116057] RIP: 0033:0x7f9912e8b26a Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15077	2023-07-21 16:35:12 -07:00
Ameer Hamza	e037327bfe	spa_min_alloc should be GCD, not min Since spa_min_alloc may not be a power of 2, unlike ashifts, in the case of DRAID, we should not select the minimal value among several vdevs. Rounding to a multiple of it is unlikely to work for other vdevs. Instead, using the greatest common divisor produces smaller yet more reasonable results. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15067	2023-07-21 16:35:12 -07:00
Yuri Pankov	1a2e486d25	Don't panic if setting vdev properties is unsupported for this vdev type Check that vdev has valid zap and bail out early. While here, move objid selection out of the loop, it's not going to change. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15063	2023-07-21 16:35:12 -07:00
Ameer Hamza	d8011707cc	Ignore pool ashift property during vdev attachment Ashift can be set for a vdev only during its creation, and the top-level vdev does not change when a vdev is attached or replaced. The ashift property should not be used during attachment, as it does not allow attaching/replacing a vdev if the pool's ashift property is increased after the existing vdev was created. Instead, we should be able to attach the vdev if the attached vdev can satisfy the ashift requirement with its parent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15061	2023-07-21 16:35:12 -07:00
Alexander Motin	83b0967c1f	Do not request data L1 buffers on scan prefetch. Set ARC_FLAG_NO_BUF when prefetching data L1 buffers for scan. We do not prefetch data L0 buffers, so we do not need the L1 buffers, only want them to be ready in ARC. This saves some CPU time on the buffers decompression. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15029	2023-07-21 16:35:12 -07:00
Yuri Pankov	5299f4f289	set autotrim default to 'off' everywhere As it turns out having autotrim default to 'on' on FreeBSD never really worked due to mess with defines where userland and kernel module were getting different default values (userland was defaulting to 'off', module was thinking it's 'on'). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15079	2023-07-21 16:35:12 -07:00
Alan Somers	f917cf1c03	Fix the ZFS checksum error histograms with larger record sizes My analysis in PR #14716 was incorrect. Each histogram bucket contains the number of incorrect bits, by position in a 64-bit word, over the entire record. 8-bit buckets can overflow for record sizes above 2k. To forestall that, saturate each bucket at 255. That should still get the point across: either all bits are equally wrong, or just a couple are. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15049	2023-07-21 16:35:12 -07:00
Alexander Motin	56ed389a57	Fix raw receive with different indirect block size. Unlike regular receive, raw receive require destination to have the same block structure as the source. In case of dnode reclaim this triggers two special cases, requiring special handling: - If dn_nlevels == 1, we can change the ibs, but dnode_set_blksz() should not dirty the data buffer if block size does not change, or durign receive dbuf_dirty_lightweight() will trigger assertion. - If dn_nlevels > 1, we just can't change the ibs, dnode_set_blksz() would fail and receive_object would trigger assertion, so we should destroy and recreate the dnode from scratch. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15039 (cherry picked from commit `c4e8742149`)	2023-07-20 08:58:29 -07:00
Alexander Motin	e613e4bbe3	Avoid extra snprintf() in dsl_deadlist_merge(). Since we are already iterating the ZAP, we have exact string key to remove, we do not need to call zap_remove_int() with the int key we just converted, we can call zap_remove() for the original string. This should make no functional change, only a micro-optimization. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15056 (cherry picked from commit `fdba8cbb79`)	2023-07-20 08:58:29 -07:00
Alexander Motin	b4e630b00c	Add missed DMU_PROJECTUSED_OBJECT prefetch. It seems `9c5167d19f` "Project Quota on ZFS" missed to add prefetch for DMU_PROJECTUSED_OBJECT during scan (scrub/resilver). It should not cause visible problems, but may affect scub/resilver performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15024	2023-07-20 08:58:29 -07:00
Alexander Motin	1266cebf87	FreeBSD: Fix build on stable/13 after 1302506. Starting approximately from version 1302506 vn_lock_pair() grown two additional arguments following head. There is a one week hole, but that is closet reference point we have. Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15047	2023-07-20 08:58:29 -07:00
Prakash Surya	945e39fc3a	Enable tuning of ZVOL open timeout value The default timeout for ZVOL opens may not be sufficient for all cases, so we should enable the value to be more easily tuned to account for systems where the default value is insufficient. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #15023	2023-06-30 11:34:05 -07:00
Rich Ercolani	2b10e32561	Pack our DDT ZAPs a bit denser. The DDT is really inefficient on 4k and up vdevs, because it always allocates 4k blocks, and while compression could save us somewhat at ashift 9, that stops being true. So let's change the default to 32 KiB, which seems like a reasonable compromise between improved space savings and inflated write sizes for DDT updates. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14654	2023-06-30 09:42:02 -07:00
Rob N	61ab05cac7	ddt_addref: remove unnecessary phys fill when refcount is 0 The previous comment wondered if this case could happen; it turns out that it really can't. This block can only be entered if dde_type and dde_class are "real"; that only happens when a ddt entry has been previously synced to a ddt store, that is, it was created on a previous txg. Since its gone through that sync, its dde_refcount must be >0. ddt_addref() is called from brt_pending_apply(), which is called at the beginning of spa_sync(), before pending DMU writes/frees are issued. Freeing a dedup block is the only thing that can decrement dde_refcount, so there's no way for it to drop to zero before applying the clone bumps it. Further, even if it _could_ go to zero, it wouldn't be necessary to fill the entry from the block. The phys content is not cleared until the free is issued, which happens when the refcount goes to zero, when the last real free comes through. The cloned block should be identical to what's in the phys already, so the fill should be a no-op anyway. I've replaced this with an assertion because this is all very dependent on the ordering in which BRT and DDT changes are applied, and that might change in the future. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: Klara, Inc. Closes #15004	2023-06-30 09:01:58 -07:00
Alexander Motin	233425a153	Again fix race between zil_commit() and zil_suspend(). With zl_suspend read in zil_commit() not protected by any locks it is possible for new ZIL writes to be in progress while zil_destroy() called by zil_suspend() freeing them. This patch closes the race by taking zl_issuer_lock in zil_suspend() and adding the second zl_suspend check to zil_get_commit_list(), protected by the lock. It allows all already queued transactions to be logged normally, while blocks any new ones, calling txg_wait_synced() for the TXGs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14979	2023-06-30 08:59:39 -07:00
Alexander Motin	b4a0873092	Some ZIO micro-optimizations. - Pack struct zio_prop by 4 bytes from 84 to 80. - Skip new child ZIO locking while linking to parent. The newly allocated ZIO is not externally visible yet, so nobody should care. - Skip io_bp_copy writes when not used (write && non-debug). Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14985	2023-06-30 08:54:00 -07:00
Alexander Motin	fa7b2390d4	Do not report bytes skipped by scan as issued. Scan process may skip blocks based on their birth time, DVA, etc. Traditionally those blocks were accounted as issued, that caused reporting of hugely over-inflated numbers, having nothing to do with actual disk I/O. This change utilizes never used field in struct dsl_scan_phys to account such skipped bytes, allowing to report how much data were actually scrubbed/resilvered and what is the actual I/O speed. While formally it is an on-disk format change, it should be compatible both ways, so should not need a feature flag. This should partially address the same issue as `c85ac731a0`, but from a different perspective, complementing it. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15007	2023-06-30 08:47:13 -07:00
Alexander Motin	a9d6b0690b	ZIL: Fix another use-after-free. lwb->lwb_issued_txg can not be accessed after lwb_state is set to LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be freed by zil_sync(). We must save the txg number before that. This is similar to the `55b1842f92`, but as I see the bug is not new. It existed for quite a while, just was not triggered due to smaller race window. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14988 Closes #14999	2023-06-27 17:03:37 -07:00
Alexander Motin	b0cbc1aa9a	Use big transactions for small recordsize writes. When ZFS appends files in chunks bigger than recordsize, it borrows buffer from ARC and fills it before opening transaction. This supposed to help in case of page faults to not hold transaction open indefinitely. The problem appears when recordsize is set lower than default 128KB. Since each block is committed in separate transaction, per-transaction overhead becomes significant, and what is even worse, active use of of per-dataset and per-pool locks to protect space use accounting for each transaction badly hurts the code SMP scalability. The same transaction size limitation applies in case of file rewrite, but without even excuse of buffer borrowing. To address the issue, disable the borrowing mechanism if recordsize is smaller than default and the write request is 4x bigger than it. In such case writes up to 32MB are executed in single transaction, that dramatically reduces overhead and lock contention. Since the borrowing mechanism is not used for file rewrites, and it was never used by zvols, which seem to work fine, I don't think this change should create significant problems, partially because in addition to the borrowing mechanism there are also used pre-faults. My tests with 4/8 threads writing several files same time on datasets with 32KB recordsize in 1MB requests show reduction of CPU usage by the user threads by 25-35%. I would measure it in GB/s, but at that block size we are now limited by the lock contention of single write issue taskqueue, which is a separate problem we are going to work on. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14964	2023-06-27 17:00:30 -07:00
Alexander Motin	8469b5aac0	Another set of vdev queue optimizations. Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14925	2023-06-27 09:09:48 -07:00
Rich Ercolani	35a6247c5f	Add a delay to tearing down threads. It's been observed that in certain workloads (zvol-related being a big one), ZFS will end up spending a large amount of time spinning up taskqs only to tear them down again almost immediately, then spin them up again... I noticed this when I looked at what my mostly-idle system was doing and wondered how on earth taskq creation/destroy was a bunch of time... So I added a configurable delay to avoid it tearing down tasks the first time it notices them idle, and the total number of threads at steady state went up, but the amount of time being burned just tearing down/turning up new ones almost vanished. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14938	2023-06-26 13:57:12 -07:00
Alexander Motin	8e8acabdca	Fix memory leak in zil_parse(). `482da24e2` missed arc_buf_destroy() calls on log parse errors, possibly leaking up to 128KB of memory per dataset during ZIL replay. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14987	2023-06-17 19:51:37 -07:00
Alexander Motin	ccec7fbe1c	Remove ARC/ZIO physdone callbacks. Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14948	2023-06-15 10:49:03 -07:00
Alexander Motin	d057807ede	Switch refcount tracking from lists to AVL-trees. With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14970	2023-06-14 08:02:27 -07:00
George Amanakis	8af1104f83	Store the L2ARC device ashift in the vdev label If this is not done, and the pool has an ashift other than the default (at the moment 9) then the following happens: 1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but upon export it is not stored anywhere 2) at the first import, vdev_open() sees an vdev_ashift() of 0 and assigns the logical_ashift, which is 9 3) reading the contents of L2ARC, including the header fails 4) L2ARC buffers are not restored in ARC. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14313 Closes #14963	2023-06-14 08:01:17 -07:00
George Amanakis	feff9dfed3	Fix the L2ARC write size calculating logic (2) While commit `bcd5321` adjusts the write size based on the size of the log block, this happens after comparing the unadjusted write size to the evicted (target) size. In this case l2ad_hand will exceed l2ad_evict and violate an assertion at the end of l2arc_write_buffers(). Fix this by adding the max log block size to the allocated size of the buffer to be committed before comparing the result to the target size. Also reset the l2arc_trim_ahead ZFS module variable when the adjusted write size exceeds the size of the L2ARC device. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14936 Closes #14954	2023-06-09 17:05:47 -07:00
Alexander Motin	70ea484e3e	Finally drop long disabled vdev cache. It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14953	2023-06-09 12:40:55 -07:00
Alexander Motin	b3ad3f48d9	Use list_remove_head() where possible. ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14955	2023-06-09 10:12:52 -07:00
Alexander Motin	55b1842f92	ZIL: Fix race introduced by `f63811f072`. We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE state and dropping zl_lock, since it may be freed by zil_sync(). To free itxs and waiters after dropping the lock we need to move lwb_itxs and lwb_waiters lists elements to local storage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14957 Closes #14959	2023-06-09 10:08:05 -07:00
Brian Behlendorf	93f8abeff0	Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926 ) When a kmem cache is exhausted and needs to be expanded a new slab is allocated. KM_SLEEP callers can block and wait for the allocation, but KM_NOSLEEP callers were incorrectly allowed to block as well. Resolve this by attempting an emergency allocation as a best effort. This may fail but that's fine since any KM_NOSLEEP consumer is required to handle an allocation failure. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2023-06-07 10:43:43 -07:00
George Amanakis	bcd5321039	Fix the L2ARC write size calculating logic l2arc_write_size() should return the write size after adjusting for trim and overhead of the L2ARC log blocks. Also take into account the allocated size of log blocks when deciding when to stop writing buffers to L2ARC. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14939	2023-06-06 12:32:37 -07:00
Rob Norris	8653f1de48	zdb: add -B option to generate backup stream This is more-or-less like `zfs send`, but specifying the snapshot by its objset id for situations where it can't be referenced any other way. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14642	2023-06-05 11:54:42 -07:00
Rob Norris	2b9f8ba673	znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14642	2023-06-05 11:53:44 -07:00
Alexander Motin	5ba4025a8d	Introduce zfs_refcount_(add\|remove)_few(). There are two places where we need to add/remove several references with semantics of zfs_refcount_(add\|remove). But when debug/tracing is disabled, it is a crime to run multiple atomic_inc() in a loop, especially under congested pool-wide allocator lock. Introduced new functions implement the same semantics as the loop, but without overhead in production builds. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14934	2023-06-05 11:51:44 -07:00
Alexander Motin	482da24e20	ZIL: Allow to replay blocks of any size. There seems to be no reason for ZIL blocks to be limited by 128KB other than replay code is written in such a way. This change does not increase the limit yet, just removes the artificial limitation. Avoided extra memcpy() may save us a second during replay. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14910	2023-06-02 11:01:58 -07:00
Luís Henriques	928c81f4df	Fix NULL pointer dereference when doing concurrent 'send' operations A NULL pointer will occur when doing a 'zfs send -S' on a dataset that is still being received. The problem is that the new 'send' will rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will fail), but then dmu_send() will still do the dsl_dataset_disown(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Luís Henriques <henrix@camandro.org> Closes #14903 Closes #14890	2023-05-30 15:15:24 -07:00
Richard Yao	677c6f8457	btree: Implement faster binary search algorithm This implements a binary search algorithm for B-Trees that reduces branching to the absolute minimum necessary for a binary search algorithm. It also enables the compiler to inline the comparator to ensure that the only slowdown when doing binary search is from waiting for memory accesses. Additionally, it instructs the compiler to unroll the loop, which gives an additional 40% improve with Clang and 8% improvement with GCC. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm. Micro-benchmarks suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14866	2023-05-26 10:03:12 -07:00
George Amanakis	bb736d98d1	Fix inconsistent definition of zfs_scrub_error_blocks_per_txg Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14894	2023-05-26 09:53:00 -07:00
Alexander Motin	b6fbe61fa6	zil: Add some more statistics. In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14863	2023-05-25 13:51:53 -07:00
Alexander Motin	f63811f072	ZIL: Reduce scope of per-dataset zl_issuer_lock. Before this change ZIL copied all log data while holding the lock. It caused huge lock contention on workloads with many big parallel writes. This change splits the process into two parts: first, zil_lwb_assign() estimates the log space needed for all transactions, and zil_lwb_write_close() allocates blocks and zios while holding the lock, then, after the lock in dropped, zil_lwb_commit() copies the data, and zil_lwb_write_issue() issues the I/Os. Also while there slightly reduce scope of zl_lock. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14841	2023-05-25 09:48:43 -07:00
Akash B	9d618615d1	Fix concurrent resilvers initiated at same time For draid vdevs it was possible to initiate both the sequential and healing resilver at same time. This fixes the following two scenarios. 1) There's a window where a sequential rebuild can be started via ZED even if a healing resilver has been scheduled. - This is fixed by adding additional check in spa_vdev_attach() for any scheduled resilver and return appropriate error code when a resilver is already in progress. 2) It was possible for zpool clear to start a healing resilver when it wasn't needed at all. This occurs because during a vdev_open() the device is presumed to be healthy not until the device is validated by vdev_validate() and it's set unavailable. However, by this point an async resilver will have already been requested if the DTL isn't empty. - This is fixed by cancelling the SPA_ASYNC_RESILVER request immediately at the end of vdev_reopen() when a resilver is unneeded. Finally, added a testcase in ZTS for verification. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #14881 Closes #14892	2023-05-24 12:28:09 -07:00
youzhongyang	f8447cf22e	Linux 6.4 compat: reclaimed_slab renamed to reclaimed Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14891	2023-05-24 12:23:42 -07:00
Brian Atkinson	ad0a554614	Hold db_mtx when updating db_state Commit `555ef90` did some general code refactoring for dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was not held when update db->db_state in those code block. The rest of the dbuf code always holds the db_mtx when updating db_state. This is important because cv_wait() db_changed is used to check for db_state changes. Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the db_mtx when updating db_state. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #14875	2023-05-19 13:05:53 -07:00
Brian Behlendorf	577e835f30	Probe vdevs before marking removed Before allowing the ZED to mark a vdev as REMOVED due to a hotplug event confirm that it is non-responsive with probe. Any device which can be successfully probed should be left ONLINE to prevent a healthy pool from being incorrectly SUSPENDED. This may occur for at least the following two scenarios. 1) Drive expansion (zpool online -e) in VMware environments. If, during the partition resize operation, a partition is removed and re-created then udev will send a removed event. 2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan) may result in a udev remove and add event being delivered. Finally, update the ZED to only kick in a spare when the removal was successful. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14859 Closes #14861	2023-05-19 13:05:09 -07:00
George Amanakis	482eeef804	Teach zpool scrub to scrub only blocks in error log Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain tulsi.jain@delphix.com Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #8995 Closes #12355	2023-05-18 11:59:42 -07:00
Brian Behlendorf	e34e15ed6d	Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12451 Closes #14873	2023-05-18 10:02:20 -07:00
Richard Yao	ee7b71dbc9	Fix undefined behavior in spa_sync_props() `8eae2d214c` caused Coverity to begin complaining about "Improper use of negative value" in two places in spa_sync_props() because Coverity correctly inferred from `prop == ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name() and zpool_prop_get_type() use it an array index, which is undefined behavior. Assuming that the system does not panic from an attempt to read invalid memory, the case statement for ZPOOL_PROP_INVAL will ensure that only user properties will reach this code when prop is ZPOOL_PROP_INVAL, such that execution will continue safely. However, if we are unlucky enough to read invalid memory, then the system will panic. This issue predates the patch that caused coverity to begin complaining. Thankfully, our userland tools do not pass nonsense to us, so this bug should not be triggered unless a future userland tool attempts to set a property that we do not understand. Reported-by: Coverity (CID-1561129) Reported-by: Coverity (CID-1561130) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14860	2023-05-15 10:29:05 -07:00
Richard Yao	c87798d8ff	Fix use after free regression in spa_remove_healed_errors() `6839ec6f10` placed code in spa_remove_healed_errors() that uses a pointer after the kmem_free() call that frees it. Reported-by: Coverity (CID-1562375) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14860	2023-05-15 10:29:01 -07:00
Alexander Motin	7381ddf1ab	zil: Free lwb_buf after write completion. There is no sense to keep that memory allocated during the flush. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14855	2023-05-12 09:49:26 -07:00
Alexander Motin	895e03135e	zil: Some micro-optimizations. Should not cause functional changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14854	2023-05-12 09:14:29 -07:00
Pawel Jakub Dawidek	e610766838	Make sure we are not trying to clone a spill block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:07:15 -07:00
Pawel Jakub Dawidek	fbbe5e96ef	Correct comment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:07:11 -07:00
Pawel Jakub Dawidek	9879930f7a	Remove badly placed comment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:07:07 -07:00
Pawel Jakub Dawidek	b6d7370b9d	Don't call zfs_exit_two() before zfs_enter_two(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:07:02 -07:00

1 2 3 4 5 ...

4303 Commits