mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Alexander Motin	a41ef36858	DDT: Reduce global DDT lock scope during writes Before this change DDT lock was taken 4 times per written block, and as effectively a pool-wide lock it can be highly congested. This change introduces a new per-entry dde_io_lock, protecting some fields during I/O ready and done stages, so that we don't need the global lock there. According to my write tests on 64-thread system with 4KB blocks this significantly reduce the global lock contention, reducing CPU usage from 100% to expected ~80%, and increasing write throughput by 10%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17960	2025-12-10 10:21:29 -08:00
Alexander Motin	a785ddc5f3	DDT: Switch to using wmsums for lookup stats ddt_lookup() is a very busy code under a highly congested global lock. Anything we can save here is very important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17980	2025-12-10 10:21:29 -08:00
Alexander Motin	2aad3dee23	DDT: Make children writes inherit allocator Even though unlike gang children it is not so critical for dedup children to inherit parent's allocator, there is still no reason for them to have allocation policy different from normal writes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17961	2025-12-10 10:21:29 -08:00
Alexx Saver	f45622ff42	chksum: run 256K benchmark on demand, preserve chksum_stat_data Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexx Saver <lzsaver.eth@ethermail.io> Co-authored-by: Adam Moss <c@yotes.com> Closes #17945 Closes #17946	2025-12-10 10:21:29 -08:00
Mariusz Zaborski	1e8c96d7d5	Add knob to disable slow io notifications Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that allows users to disable notifications for slow devices. This prevents ZED and/or ZFSD from degrading the pool due to slow I/O. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes 17477	2025-11-12 13:07:14 -08:00
Alexander Motin	41878d57ea	Add BRT support to zpool prefetch command Implement BRT (Block Reference Table) prefetch functionality similar to existing DDT prefetch. This allows preloading BRT metadata into ARC to improve performance for block cloning operations and frees of earlier cloned blocks. Make -t parameter optional. When omitted, prefetch all supported metadata types (both DDT and BRT now). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17890	2025-11-12 13:07:09 -08:00
Alexander Motin	002bc3da6a	BRT: Increase block size from 4KB to 8KB According to my observations, BRT ZAPs are typically compressible 3:1 for data and 2:1 for indirects. With ashift=12, typical these days, it means increasing the block sizes to 8KB we may get most of possible compression, reducing on-disk and in-ARC BRT footprint in half by the cost of some compression/decompression overhead, but without real write inflation, only some dirty data increase. Increase to 32KB similar to DDT could further increase compression and storage efficiency, but at the cost of write inflation and much bigger dirty data increase, which we can not properly control now. So lets leave this for a time when BRT log gets implemented. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17916	2025-11-12 13:07:04 -08:00
Alexander Motin	e895c76194	ZAP: Remove dmu_object_info_from_dnode() call dmu_object_info_from_dnode() takes two locks and copies plenty of data that we don't need in zap_lockdir_impl(). Just read dn_type directly in this hot path. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17921	2025-11-12 13:07:00 -08:00
Rob Norris	ac0bc4cc00	spa_misc: add an API for spa_namespace_lock This is useful as debugging support, as it lets namespace lock operations be traced directly. It will also be useful for future work to reduce the use of spa_namespace_lock, traditionally a source of difficult deadlocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17906	2025-11-12 13:06:54 -08:00
Alexander Motin	aaf374bd40	ZIO: Set minimum number of free issue threads to 32 Free issue threads might block waiting for synchronous DDT, BRT or GANG header reads. So unlike other taskqs using ZTI_SCALE to scale with number of CPUs, here we also need some amount of threads to potentially saturate pool reads. I am not sure we always want the 96 threads we had before ZTI_SCALE introduction at #11966 on small systems, but lets make it at least 32. While here, make free taskqs configurable, similar to read and write ones. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17903	2025-11-12 13:06:39 -08:00
Adi-Goll	015729a11b	Fix typo in vdev_raidz.c Change the spelling of "begining" on line 4875 to "beginning". Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com> Closes #17905	2025-11-12 13:06:19 -08:00
Paul Dagnelie	dda711dbb5	Fix gang write late_arrival bug When a write comes in via dmu_sync_late_arrival, its txg is equal to the open TXG. If that write gangs, and we have not yet activated the new gang header feature, and the gang header we pick can store a larger gang header, we will try to schedule the upgrade for the open TXG + 1. In debug mode, this causes an assertion to trip. This PR sets the TXG for activating the feature to be the larger of either the current open TXG or the syncing TXG + 1. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17824	2025-11-12 13:05:54 -08:00
Robert Evans	5582e8b08e	Update dnode_next_offset_level to accept blkid instead of offset Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com> Closes #17792	2025-11-12 13:05:40 -08:00
Alexander Motin	67fc49433f	Cleanup ZIO_FLAG_IO_RETRY vs TRYHARD usage In cases where all issued ZIOs must succeed, and we can't do anything clever about the errors, we should just explicitly set ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries. In other cases, where retries can be different from the original, for example, some ZIOs are allowed to fail due to redundancy, or we can disable aggregation on retrial to get at least some of the data, we can do first pass without TRYHARD, and only if needed retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics). Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17877	2025-11-12 13:05:31 -08:00
Alexander Motin	e3acd0a728	Fix caching of DDT log and BRT Both DDT log and BRT counters we read on pool import and then only append or overwrite in full blocks. We don't need them in DMU or ARC caches. Fortunately we have DMU_UNCACHEDIO for this now. Even more we don't need BRT in non-evictable metadata DMU caches, since it will likely never fit there, while block the cache from its original users. Since DMU_OT_IS_METADATA_CACHED() has no way to differentiate the new metadata types, mark BRT with storage type of DMU_OT_DDT_ZAP. As side effect it will also put it on dedup device, but that should actually be right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-11-12 13:05:25 -08:00
Alexander Motin	178a8be216	BRT: Round bv_entcount up to BRT_BLOCKSIZE Since we set bv_mos_brtvdev block size, and since we keep dirty bitmap at the same granularity, we should keep the allocations and writes done with. Otherwise it makes the last block write short, that will be odd once we implement writing of only dirty blocks, but also requires read-modify-write on DMU layer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-11-12 13:05:21 -08:00
Alexander Motin	5847626175	Pass flags to more DMU write/hold functions Over the time many of DMU functions got flags argument to control prefetch, caching, etc. Few functions though left without it, even though closer look shown that many of them do not require prefetch due to their access pattern. This patch adds the flags argument to dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(), passing DMU_READ_NO_PREFETCH where applicable. I am going to also pass DMU_UNCACHEDIO to some of them later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17872	2025-11-12 13:04:58 -08:00
Andrew Walker	799bda73e2	Fix return value for setting zvol threading We must return -1 instead of ENOENT if the special zvol threading property set function can't locate the dataset (this would typically happen with an encypted and unmounted zvol) so that the operation gets inserted properly into the nvlist for operations to set. This is because we want the property to be set once the zvol is decrypted again. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #17836	2025-10-21 09:50:43 -07:00
Brian Behlendorf	7987d4deb4	Update device removal documentation Make a minor update to the 'zpool remove' man page to clarify both raidz and draid pools do not support removal, and change sector to ashift which is what we actually care about. Update the big theory comment in vdev_removal.c to accurately reflect which types of vdevs can be removed. Furthermore, I've added some discussion for the casual reader to briefly explain the top-level vdev removal restrictions. This has been a common area of confusion and it's not intuitive where they come from without understanding the implementation details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17847	2025-10-21 09:50:43 -07:00
Shreshth3	f16fa115d1	arc: fix small typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com> Closes #17840	2025-10-21 09:50:43 -07:00
Mark Johnston	c1f55bff8b	Fix the type of the raidz_outlier_check_interval_ms parameter It's an hrtime_t, which is an unsigned long long. In practice this is just a U64. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #17833	2025-10-21 09:50:43 -07:00
Mateusz Guzik	6c73fd8eeb	Annotate arc_buf_is_shared as __maybe_unused Otherwise the compiler warns about it on production FreeBSD builds. The routine proved resilient to attempts to ifdef on debug. Sponsored by: Rubicon Communications, LLC ("Netgate") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #17818	2025-10-21 09:50:43 -07:00
Igor Ostapenko	b9d1e28a71	ddt prune: Add SCL_ZIO deadlock workaround Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17793	2025-10-21 09:50:43 -07:00
Igor Ostapenko	01180a63bd	spa_config: Rename spa_config_enter_mmp() to spa_config_enter_priority() Originally this was created for MMP, but now new cases are emerging where the same mechanism is required. Hence the name's generalization. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17793	2025-10-21 09:50:43 -07:00
Robert Evans	ead0fb736d	zinject: Introduce ready delay fault injection This adds a pause to the ZIO pipeline in the ready stage for matching I/O (data, dnode, or raw bookmark). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #17787	2025-10-21 09:50:43 -07:00
hoshinomori	f3295ec763	range_tree: drop duplicate zfs_ prefix from rs_set_fill_raw Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: hoshinomori <hoshinomori@owarisekai.moe> Closes #17800	2025-09-29 16:50:53 -07:00
Robert Evans	460858dfd6	dnode_next_offset: backtrack if lower level does not match This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before `f664f1ee7f` (#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is #11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #16025 Closes #11196	2025-09-25 12:08:17 -07:00
Brian Behlendorf	954fe5e1be	Add interface to interface spa_get_worst_case_min_alloc() function Provide an interface to retrieve the lowest and highest minimum allocation size for the normal allocation class. This can be used by external consumers of the DMU to estimate potential wasted capacity when setting the recordsize for an object. The new "min_alloc" and "max_alloc" keys are added to the pool configuration and used by default_volblocksize() to warn when an ineffecient block size is requested. For older kmods which don't yet include the new keys fallback to the previous logic. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17758	2025-09-25 12:08:14 -07:00
Alexander Motin	efdb4bf07a	Fix two infinite loops if dmu_prefetch_max set to zero Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17692 Closes #17729	2025-09-15 12:43:39 -07:00
Paul Dagnelie	cac483dbd4	Fix time database update calculations The time database update math assumed that the timestamps were in nanoseconds, but at some point in the development or review process they changed to seconds. This PR fixes the math to use seconds instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17735	2025-09-15 12:43:34 -07:00
Alexander Motin	41c6eaac8b	Fix type in dbrrd_closest() For ABS() to work, the argument must be signed, but rrdd_time is uint64_t. Clang noticed it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Fixes #16853 Closes #17733	2025-09-12 15:05:22 -07:00
Chunwei Chen	95d677efde	Fix ddle memleak in ddt_log_load In ddt_log_load(), when removing dup entry from flushing tree, it doesn't free the entry causing memleak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #17657 Closes #17730	2025-09-12 15:05:10 -07:00
Allan Jude	6c4ede4026	ZFS allow send:encrypted A new `zfs allow` permissions that ONLY allows sending replication streams in raw (encrypted) mode, so encrypted data will not be decrypted as part of the replication process. Sponsored-by: Klara, Inc. Sponsored-by: Karakun AG Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Co-authored-by: JT Pennington <jt.pennington@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #17543	2025-09-12 15:05:02 -07:00
Brian Behlendorf	3dc345851c	Prevent scrubbing a read-only pool While it would be nice to be able to scrub a pool imported read-only this will currently trip an ASSERT. Before we can support this there are some designs challenges which need to be thought through first. For starters, a read-only import skips reading certain information from disk which it knows won't be needed, such as the space maps. Furthermore, the scrub process expects to be checkpoint it's progress, update the on disk error log, and issue repair IO. None of which would be possible when the pool is imported read-only. Each of these wrinkles can certainly be handled, but that will take some signifcant work. In the meanwhile we disable the 'zpool scrub' command when the pool is imported read-only. Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #17527 Closes #17717	2025-09-11 15:58:52 -07:00
Paul Dagnelie	df55ba7c49	Detect a slow raidz child during reads A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227	2025-09-10 15:31:30 -07:00
Paul Dagnelie	0df85ec27c	Remove RAIDZ reconstruct flags from debug defaults Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17227	2025-09-10 15:31:25 -07:00
Paul Dagnelie	e2e708241a	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2025-09-10 15:01:32 -07:00
Chunwei Chen	c755aa486d	Fix wrong dedup_table_size for legacy dedup If we call ddt_log_load() for legacy ddt, we will end up going into ddt_log_update_stats() and filling uninitialized value into ddo_dspace. This value will then get added to dedup_table_size during ddt_get_dedup_object_stats(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17019 Closes #17699 Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com>	2025-09-09 17:06:24 -07:00
Rob Norris	56e8ab4a3e	zvol: reject suspend attempts when zvol is shutting down Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17690	2025-09-09 17:04:32 -07:00
youzhongyang	774a34f3ff	Synchronize the update of feature refcount The concurrent execution of feature_sync() can lead to a panic due to an unprotected update of the feature refcount. Resolve this by using the spa->spa_feat_stats_lock to synchronize the update of the refcount. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #17184 Closes #17632	2025-09-09 17:03:27 -07:00
Rob Norris	574eec2964	dnode: remove dn_dirtyctx and dnode_dirtycontext Only used for a couple of debug assertions which had very little value. Setting it required taking certain locks, so we can remove all that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:38 -07:00
Rob Norris	aa6f0f878b	dnode: remove dn_dirtyctx_firstset Old debug param, not used for anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:36 -07:00
Rob Norris	eecff1b4a9	dnode: remove dn_dirty_txg and DNODE_IS_DIRTY dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only existed to ensure that a dnode was clean before making it eligible for removal from the array of cached dnodes attached to the object 0 L0 dbuf. dn_dirtycnt is enough to check that now, so use it directly and remove the rest. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:35 -07:00
Rob Norris	f3e49b0cf5	dnode_is_dirty: reimplement in terms of dn_dirtycnt Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:33 -07:00
Rob Norris	3abf72b251	dnode: add dn_dirtycnt, count of number of txgs this dnode is dirty on Bumped when we take the dirty hold in dnode_setdirty(), dropped when the dnode is finally cleaned up after sync in dnode_rele_task() or userquota_updates_task(). This gives us a way to check if the dnode is dirty on any txg without having to rely on outside information (eg presence on a dirty list), which has been a rich source of bugs in the past. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Suggested-by: Robert Evans <evansr@google.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:29 -07:00
Rob Norris	dcd73069f0	zvol_remove_minors_impl: remove all async fallbacks Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:47 -07:00
Rob Norris	8a0e5e8b54	zvol: stop using zvol_state_lock to protect OS-side private data zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:34 -07:00
Rob Norris	96f9d271ea	zvol: remove the OS-side minor before freeing the zvol When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:21 -07:00
Rob Norris	b2c792778c	zvol: generalise zvol_remove_minors_impl() for single zvol case zvol_remove_minor_impl() and zvol_remove_minors_impl() should be identical except for how they select zvols to remove, so lets just use the same function with a flag to indicate if we should include children and snapshots or not. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:11 -07:00
Brian Behlendorf	5061f959d1	Retire zfs_autoimport_disable kmod option Back in 2014 the zfs_autoimport_disable module option was added to control whether the kmods should load the pool configs from the cache file on module load. The default value since that time has been for the kernel to not process the cache file. Detecting and importing pools during boot is now controlled outside of the kmod on both Linux and FreeBSD. By all accounts this has been working well and we can remove this dormant code on the kernel side. The spa_config_load() function is has been moved to userspace, it is now only used by libzpool. Additionally, the spa_boot_init() hook which was used by FreeBSD now looks to be used and was removed. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17618	2025-08-14 14:58:58 -07:00

1 2 3 4 5 ...

3501 Commits