mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	a65bb7c518	mmp: claim sequence id before final import As part of SPA_LOAD_IMPORT add an additional activity check to detect simultaneous imports from different hosts. This check is only required when the timing is such that there's no activity for the the read-only tryimport check to detect. This extra safety chceck operates as follows: 1. Repeats the following MMP check 10 times: a. Write out an MMP uberblock with the best txg and a random sequence id to all primary pool vdevs. b. Verify a minimum number of good writes such that even if the pool appears degraded on the remote host it will see at least one of the updated MMP uberblocks. c. Wait for the MMP interval this leaves a window for other racing hosts to make similar modifications which can be detected. d. Call vdev_uberblock_load() to determine the best uberblock to use, this should be the MMP uberblock just written. e. Verify the txg and random sequeunce number match the MMP uberblock written in 1a. 2. Restore the original MMP uberblocks. This allows the check to be performed again if the pool fails to import for an unrelated reason. This change also includes some refactoring and minor improvements. - Never try loading earlier txgs during import when the import fails with EREMOTEIO or EINTER. These errors don't indicate the txg is damaged but instead that its either in use on a remote host or the import was interactively cancelled. No rewind is also performed for EBADD which can result from a stale trusted config when doing a verbatim import. - Refactor the code for consistent logging of the multihost activity check using spa_load_note() and console messages indicating when the activity check was trigger and the result. - Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier modification of the sequence number in an uberblock. - Added ZFS_LOAD_INFO_DEBUG environment variable which can be set to log to dump to stdout the spa_load_info nvlist returned during import. This is used by the updated mmp test cases to determine if an activity check was run and its result. - Standardize the mmp messages similarly to make it easier to find all the relevent mmp lines in the debug log. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-11 13:33:19 -08:00
Brian Behlendorf	328a823848	mmp: add spa_load_name() for tryimport Tryimport adds a unique prefix to the pool name to avoid name collisions. This makes it awkward to log user-friendly info during a tryimport. Add a spa_load_name() function which can be used to report the unmodified pool name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-11 13:33:19 -08:00
Brian Behlendorf	8bbd86693e	mmp: move "Starting import" log message Move the "Starting import" log message in to the import block so it's matched with the "Fiinshed importing" debug message. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-11 13:33:19 -08:00
Brian Behlendorf	36c315571c	mmp: further restrict mmp exported pool check For a cleanly exported pools there exists a small window where both systems may determine it's safe to import the pool and skip the activity check. Only allow the check to be skipped when the last imported hostid matches the systems hostid and the pool was cleanly exported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-11 13:33:19 -08:00
Rob Norris	8010a8a3ca	spa_activity_check: narrow scope of MMP vars They aren't used outside these very small blocks, and their initial values are never used at all. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2026-02-11 13:33:19 -08:00
Paul Dagnelie	a1d839eddd	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2026-02-11 13:33:12 -08:00
Alexander Motin	44e6a07bff	ZIO: Set minimum number of free issue threads to 32 Free issue threads might block waiting for synchronous DDT, BRT or GANG header reads. So unlike other taskqs using ZTI_SCALE to scale with number of CPUs, here we also need some amount of threads to potentially saturate pool reads. I am not sure we always want the 96 threads we had before ZTI_SCALE introduction at #11966 on small systems, but lets make it at least 32. While here, make free taskqs configurable, similar to read and write ones. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17903	2025-12-19 19:55:14 -08:00
bspengler-oss	445879656b	Preserve LIFO ordering of kmap ops in abd_raidz_gen_iterate() ZFS typically preserves proper LIFO ordering regarding map/unmap operations that wrap the Linux kernel's kmap interfaces that require such ordering, but one instance in abd_raidz_gen_iterate() did not. Similar issues have been fixed in the Linux kernel in the past, see for instance CVE-2025-39899 for userfaultfd. Reviewed-by: RageLtMan <rageltman@sempervictus> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com> Closes #15668 Closes #18030	2025-12-19 19:55:14 -08:00
youzhongyang	8e7a310860	Synchronize the update of feature refcount The concurrent execution of feature_sync() can lead to a panic due to an unprotected update of the feature refcount. Resolve this by using the spa->spa_feat_stats_lock to synchronize the update of the refcount. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #17184 Closes #17632	2025-11-05 08:53:06 -08:00
Alexander Motin	81ceee0cff	Fix two infinite loops if dmu_prefetch_max set to zero Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17692 Closes #17729	2025-10-22 10:36:30 -07:00
Robert Evans	a60214e792	dnode_next_offset: backtrack if lower level does not match This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before `f664f1ee7f` (#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is #11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #16025 Closes #11196	2025-10-21 11:30:30 -07:00
Brian Behlendorf	2b0502c578	Add interface to interface spa_get_worst_case_min_alloc() function Provide an interface to retrieve the lowest and highest minimum allocation size for the normal allocation class. This can be used by external consumers of the DMU to estimate potential wasted capacity when setting the recordsize for an object. The new "min_alloc" and "max_alloc" keys are added to the pool configuration and used by default_volblocksize() to warn when an ineffecient block size is requested. For older kmods which don't yet include the new keys fallback to the previous logic. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17758	2025-10-21 11:02:42 -07:00
Brian Behlendorf	0fe10361ba	Allow vmem_alloc backed multilists Systems with a large number of CPU cores (192+) may trigger the large allocation warning in multilist_create() on Linux. Silence the warning by converting the allocation to vmem_alloc(). On Linux this results in a call to kvalloc() which will alloc vmem for large allocations and kmem for small allocations. On FreeBSD both vmem_alloc and kmem_alloc internally use the same allocator so there is no functional change. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17616	2025-08-12 17:24:30 -07:00
Rob Norris	f72226a75c	Linux: sync: remove async/sync accounting All this machinery is there to try to understand when there an async writeback waiting to complete because the intent log callbacks are still outstanding, and force them with a timely zil_commit(). The next commit fixes this properly, so there's no need for all this extra housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17584	2025-08-12 17:23:39 -07:00
Olivier Certner	5289f6f961	spa: ZIO_TASKQ_ISSUE: Use symbolic priority This allows to change the meaning of priority differences in FreeBSD without requiring code changes in ZFS. This upstreams commit fd141584cf89d7d2 from FreeBSD src. Sponsored-by: The FreeBSD Foundation Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Olivier Certner <olce@FreeBSD.org> Closes #17489	2025-08-12 17:16:00 -07:00
Alexander Motin	22eb2bdce3	Make TX abort after assign safer It is not right, but there are few examples when TX is aborted after being assigned in case of error. To handle it better on production systems add extra cleanup steps. While here, replace couple dmu_tx_abort() in simple cases. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17438	2025-08-07 12:41:40 -04:00
Alexander Motin	809b553940	Introduce zfs rewrite subcommand (#17246 ) This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-08-07 12:34:28 -04:00
khoang98	c405a7a35c	Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim threads waiting for the dbuf hash lock in the call sequence: dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Kaitlin Hoang <kthoang@amazon.com> Closes #17561	2025-08-07 12:15:14 -04:00
shodanshok	4808641e71	enforce arc_dnode_limit Linux kernel shrinker in the context of null/root memcg does not scan dentry and inode caches added by a task running in non-root memcg. For ZFS this means that dnode cache routinely overflows, evicting valuable meta/data and putting additional memory pressure on the system. This patch restores zfs_prune_aliases as fallback when the kernel shrinker does nothing, enabling zfs to actually free dnodes. Moreover, it (indirectly) calls arc_evict when dnode_size > dnode_limit. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17487 Closes #17542	2025-08-07 12:11:34 -04:00
Alexander Motin	30fa92bff3	Increase meta-dnode redundancy in "some" mode Loss of one indirect block of the meta dnode likely means loss of the whole dataset. It is worse than one file that the man page promises, and in my opinion is not much better than "none" mode. This change restores redundancy of the meta-dnode indirect blocks, while same time still corrects expectations in the man page. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17339	2025-08-05 13:15:44 -04:00
Paul Dagnelie	fd5a27c9db	Ensure that gang_copies is always at least as large as copies As discussed in the comments of PR #17004, you can theoretically run into a case where a gang child has more copies than the gang header, which can lead to some odd accounting behavior (and even trip a VERIFY). While the accounting code could be changed to handle this, it fundamentally doesn't seem to make a lot of sense to allow this to happen. If the data is supposed to have a certain level of reliability, that isn't actually achieved unless the gang_copies property is set to match it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17484	2025-08-05 13:14:45 -04:00
Paul Dagnelie	a46ce73ca8	Make ganging redundancy respect redundant_metadata property (#17073 ) The redundant_metadata setting in ZFS allows users to trade resilience for performance and space savings. This applies to all data and metadata blocks in zfs, with one exception: gang blocks. Gang blocks currently just take the copies property of the IO being ganged and, if it's 1, sets it to 2. This means that we always make at least two copies of a gang header, which is good for resilience. However, if the users care more about performance than resilience, their gang blocks will be even more of a penalty than usual. We add logic to calculate the number of gang headers copies directly, and store it as a separate IO property. This is stored in the IO properties and not calculated when we decide to gang because by that point we may not have easy access to the relevant information about what kind of block is being stored. We also check the redundant_metadata property when doing so, and use that to decide whether to store an extra copy of the gang headers, compared to the underlying blocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-08-05 13:10:40 -04:00
Igor Ostapenko	95abbc71c3	range_tree: Provide more debug details upon unexpected add/remove Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17581	2025-08-05 12:34:54 -04:00
Tino Reichardt	fc658b9935	Faster checksum benchmark on system boot While booting, only the needed 256KiB benchmarks are done now. The delay for checking all checksums occurs when requested via: - Linux: cat /proc/spl/kstat/zfs/chksum_bench - FreeBSD: sysctl kstat.zfs.misc.chksum_bench Reported by: Lahiru Gunathilake <gunathilakebllg@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-authored-by: Colin Percival <cperciva@tarsnap.com> Closes #17563 Closes #17560	2025-08-05 12:34:13 -04:00
Paul Dagnelie	271b9797c5	Don't use wrong weight when passivating group When we're passivating a metaslab group we start by passivating the metaslabs that have been activated for each of the allocators. To do that, we need to provide a weight. However, currently this erroneously always uses a segment-based weight, even if segment-based weighting is disabled. Use the normal weight function, which will decide which type of weight to use. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17566	2025-08-05 12:33:52 -04:00
Brian Behlendorf	582e7847f6	Default to zfs_bclone_wait_dirty=1 Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17455	2025-08-05 12:33:30 -04:00
Alexander Motin	347d68048a	ZIL: Force writing of open LWB on suspend Under parallel workloads ZIL may delay writes of open LWBs that are not full enough. On suspend we do not expect anything new to appear since zil_get_commit_list() will not let it pass, only returning TXG number to wait for. But I suspect that waiting for the TXG commit without having the last LWB issued may not wait for its completion, resulting in panic described in #17509. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17521	2025-08-05 12:28:41 -04:00
Paul Dagnelie	acf3871ef8	Correct weight recalculation of space-based metaslabs Currently, after a failed allocation, the metaslab code recalculates the weight for a metaslab. However, for space-based metaslabs, it uses the maximum free segment size instead of the normal weighting algorithm. This is presumably because the normal metaslab weight is (roughly) intended to estimate the size of the largest free segment, but it doesn't do that reliably at most fragmentation levels. This means that recalculated metaslabs are forced to a weight that isn't really using the same units as the rest of them, resulting in undesirable behaviors. We switch this to use the normal space-weighting function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Closes #17531	2025-08-05 12:28:34 -04:00
Paul Dagnelie	7e945a5b3f	Fix other nonrot bugs There are still a variety of bugs involving the vdev_nonrot property that will cause problems if you try to run the test suite with segment-based weighting disabled, and with other things in the weighting code. Parents' nonrot property need to be updated when children are added. When vdevs are expanded and more metaslabs are added, the weights have to be recalculated (since the number of metaslabs is an input to the lba bias function). When opening, faulted or unopenable children should not be considered for whether a vdev is nonrot or not (since the nonrot property is determined during a successful open, this can cause false negatives). And draid spares need to have the nonrot property set correctly. Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17469	2025-08-05 12:25:26 -04:00
Alexander Motin	85ce6b8ab2	Polish db_rwlock scope dbuf_verify(): Don't need the lock, since we only compare pointers. dbuf_findbp(): Don't need the lock, since aside of unneeded assert we only produce the pointer, but don't de-reference it. dnode_next_offset_level(): When working on top level indirection should lock dnode buffer's db_rwlock, since it is our parent. If dnode has no buffer, then it is meta-dnode or one of quotas and we should lock the dataset's ds_bp_rwlock instead. Reviewed-by: Alan Somers <asomers@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17441	2025-08-05 12:23:32 -04:00
Mariusz Zaborski	954894ee53	scrub: generate scrub_finish event The `scn_min_txg` can now be used not only with resilver. Instead of checking `scn_min_txg` to determine whether it’s a resilver or a scrub, simply check which function is defined. Thanks to this change, a scrub_finish event is generated when performing a scrub from the saved txg. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #17432	2025-08-05 12:21:46 -04:00
Alexander Motin	a4e775d2ca	Some arc_release() cleanup - Don't drop L2ARC header if we have more buffers in this header. Since we leave them the header, leave them the L2ARC header also. Honestly we are not required to drop it even if there are no other buffers, but then we'd need to allocate it a separate header, which we might drop soon if the old block is really deleted. Multiple buffers in a header likely mean active snapshots or dedup, so we know that the block in L2ARC will remain valid. It might be rare, but why not? - Remove some impossible assertions and conditions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17126	2025-08-05 12:16:27 -04:00
Paul Dagnelie	661310ff5c	FDT dedup log sync -- remove incremental This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Authored-by: Don Brady <don.brady@klarasystems.com> Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17038	2025-08-05 12:15:21 -04:00
Alexander Motin	f9d59b579e	ZIL: Relax parallel write ZIOs processing ZIL introduced dependencies between its write ZIOs to permit flush defer, when we flush vdev caches only once all the write ZIOs has completed. But it was recently spotted that it serializes not only ZIO completions handling, but also their ready stage. It means ZIO pipeline can't calculate checksums for the following ZIOs until all the previous are checksumed, even though it is not required. On a systems where memory throughput of a single CPU core is limited, it creates single-core CPU bottleneck, which is difficult to see due to ZIO pipeline design with many taskqueue threads. While it would be great to bypass the ready stage waits, it would require changes to ZIO code, and I haven't found a clean way to do it. But I've noticed that we don't need any dependency between the write ZIOs if the previous one has some waiters, which means it won't defer any flushes and work as a barrier for the earlier ones. Bypassing it won't help large single-thread writes, since all the write ZIOs except the last in that case won't have waiters, and so will be dependent. But in that case the ZIO processing might not be a bottleneck, since there will be only one thread populating the write buffers, that will likely be the bottleneck. But bypassing the ZIO dependency on multi-threaded write workloads really allows them to scale beyond the checksuming throughput of one CPU core. My tests with writing 12 files on a same dataset on a pool with 4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a system with Xeon Silver 4114 CPU show total throughput increase from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to ~70%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17458	2025-08-05 12:14:18 -04:00
Alexander Motin	f7e6dcc68d	Relax zfs_vnops_read_chunk_size limitations It makes no sense to limit read size below the block size, since DMU will any way consume resources for the whole block, while the current zfs_vnops_read_chunk_size is only 1MB, which is smaller that maximum block size of 16MB. Plus in case of misaligned Uncached I/O the buffer may get evicted between the chunks, requiring repeating I/Os. On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB. It allows to less depend on speculative prefetcher if application requests specific size, first not waiting for prefetcher to start and later not prefetching more than needed. Also while there, we don't need to align reads to the chunk size, but only to a block size, which is smaller and so more forgiving. My profiles show ~4% of CPU time saving when reading 16MB blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17415	2025-06-17 10:50:27 -07:00
Rob Norris	e2de00ca44	dmu_traverse: remove 'ignore_hole_birth' tunable alias It's been many years, we can probably do without. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17376	2025-06-17 10:50:27 -07:00
Allan Jude	8e9ffe1b4f	ARC: parallel eviction On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Closes #16486	2025-06-17 10:50:26 -07:00
Don Brady	6b67a5bdd3	During pool export flush the ARC asynchronously This also includes removing L2 vdevs asynchronously. This commit also guarantees that spa_load_guid is unique. The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! So we guarantee that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16215	2025-06-17 10:50:26 -07:00
Rob Norris	9c0f5bc183	zfs_log_write: only put the callback on the last itx If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17445	2025-06-17 10:50:26 -07:00
Alexander Motin	0c9cdd1606	Improve block cloning transactions accounting Previous dmu_tx_count_clone() was broken, stating that cloning is similar to free. While they might be from some points, cloning is not net-free. It will likely consume space and memory, and unlike free it will do it no matter whether the destination has the blocks or not (usually not, so previous code did nothing). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17431	2025-06-17 10:50:26 -07:00
Alexander Motin	23fad19818	Reduce zfs_dmu_offset_next_sync penalty Looking on txg_wait_synced(, 0) I've noticed that it always syncs 5 TXGs: 3 TXG_CONCURRENT_STATES + 2 TXG_DEFER_SIZE. But in case of dmu_offset_next() we do not care about deferred frees. And even concurrent TXGs we might not need sync all 3 if the dnode was not dirtied in last few TXGs. This patch makes dmu_offset_next() to sync one TXG at a time until the dnode is clean, but no more than 3 TXG_CONCURRENT_STATES times. My tests with random simultaneous writes and seeks over many files on HDD pool show 7-14% performance increase. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17434	2025-06-17 10:50:26 -07:00
Alexander Motin	3897e86bd1	Make TX abort after assign safer It is not right, but there are few examples when TX is aborted after being assigned in case of error. To handle it better on production systems add extra cleanup steps. While here, replace couple dmu_tx_abort() in simple cases. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17438	2025-06-17 10:50:26 -07:00
Alexander Motin	4c8d0471fa	Allow zero compression if dedup is enabled Having high-refcount dedup entries for zero blocks is inefficient when they could be recorded as a holes instead. Normally, zero compression is not done if compression is disabled to not confuse naive benchmarks. But with dedup enabled, it is expected that the write will be skipped anyway, so we are just optimizing the way it is skipped. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17435	2025-06-17 10:50:26 -07:00
Alexander Motin	bf4baee81e	Set spa_final_txg in spa_unload() I've noticed that after some dedup tests system reboot ends up in assertion about ms_defer tree not free. It seems to be caused by DDT flushing still freeing some blocks while ZFS trying to reach a final steady state due to spa_final_txg, while being set by spa_export_common() on pool export, is not set when spa_unload() is called by spa_evict_all() on system reboot/shutdown. Setting spa_final_txg in spa_unload() fixes this issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17395	2025-06-17 10:50:26 -07:00
Ameer Hamza	f292b0f146	vdev: skip faulting disks pending removal This patch fixes a race where vdev_remove_wanted may be set after probe initiation, which could otherwise trigger redundant fault and removal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17400	2025-06-17 10:50:26 -07:00
Rob Norris	d7bb6bbf13	tunables: fix spelling Three occurences with an 'e', and all of them mine. Maybe it's an British thing? Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	d02d3add0d	tunables: ensure tunable and variable have same define gate If a variable is only available in the kernel, then the tunable should also only be available there. This matters very little so long as we don't have userspace tunables, but its still good hygeine. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	cc5724f38d	tunables: don't assert initialisation in impl getters It actually doesn't matter if it's not initialised when we first query the current value; it just returns empty-string. A crash is quite obnoxious even if it is a rare case. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	8317244270	zfs_log: make zfs_immediate_write_sz uint Likely it's only int64 for comparison with ssize_t, which is signed. However, it would make no sense for it to be less than 0 or greater than 4G, so making it a regular uint will make it safe for comparison and remove the only S64 tunable in core. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Paul Dagnelie	65cf521353	Only interrupt active disk I/Os in failmode=continue failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17372	2025-06-17 10:49:40 -07:00

1 2 3 4 5 ...

3393 Commits