mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-22 18:40:43 +03:00

Author	SHA1	Message	Date
Ryan Moeller	ac0fd40c8c	Add zpool properties for allocation class space The existing zpool properties accounting pool space (size, allocated, fragmentation, expandsize, free, capacity) are based on the normal metaslab class or are cumulative properties of several classes combined. Add properties reporting the space accounting metrics for each metaslab class individually. Also introduce pool-wide AVAIL, USABLE, and USED properties reporting values corresponding to FREE, SIZE, and ALLOC deflated for raidz. Update ZTS to recognize the new properties and validate reported values. While in zpool_get_parsable.cfg, add "fragmentation" to the list of parsable properties. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com> Cloes #18238	2026-03-02 15:50:23 -08:00
Ryan Moeller	6ba3f915d0	zcommon: Fix description of vdev capacity format Capacity is reported as a percentage not a size. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com> Closes #18238	2026-03-02 15:49:23 -08:00
Akash B	f8e5af53e9	Fix redundant declaration of dsl_pool_t Remove redundant dsl_pool variable and duplicate spa_get_dsl() call in vdev_rebuild_thread. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Akash B <akash-b@hpe.com> Closes #18263	2026-02-27 10:39:52 -08:00
Andriy Tkachuk	f8457fbdc4	Fix deadlock on dmu_tx_assign() from vdev_rebuild() vdev_rebuild() is always called with spa_config_lock held in RW_WRITER mode. However, when it tries to call dmu_tx_assign() the latter may hang on dmu_tx_wait() waiting for available txg. But that available txg may not happen because txg_sync takes spa_config_lock in order to process the current txg. So we have a deadlock case here: - dmu_tx_assign() waits for txg holding spa_config_lock; - txg_sync waits for spa_config_lock not progressing with txg. Here are the stacks: __schedule+0x24e/0x590 schedule+0x69/0x110 cv_wait_common+0xf8/0x130 [spl] __cv_wait+0x15/0x20 [spl] dmu_tx_wait+0x8e/0x1e0 [zfs] dmu_tx_assign+0x49/0x80 [zfs] vdev_rebuild_initiate+0x39/0xc0 [zfs] vdev_rebuild+0x84/0x90 [zfs] spa_vdev_attach+0x305/0x680 [zfs] zfs_ioc_vdev_attach+0xc7/0xe0 [zfs] cv_wait_common+0xf8/0x130 [spl] __cv_wait+0x15/0x20 [spl] spa_config_enter+0xf9/0x120 [zfs] spa_sync+0x6d/0x5b0 [zfs] txg_sync_thread+0x266/0x2f0 [zfs] The solution is to pass txg returned by spa_vdev_enter(spa) at the top of spa_vdev_attach() to vdev_rebuild() and call dmu_tx_create_assigned(txg) which doesn't wait for txg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18210 Closes #18258	2026-02-26 11:18:02 -08:00
Rob Norris	f3d4c79496	zpl_super: prefer "new" mount API when available This API has been available since kernel 5.2, and having it available (almost) everywhere should give us a lot more flexibility for mount management in the future. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18260	2026-02-25 13:17:33 -08:00
Rob Norris	09c27a14a3	icp: add SHA512 implementation using Intel SHA512 extensions Generated from crypto/sha/asm/sha512-x86_64.pl in openssl/openssl@241d4826f8. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18233	2026-02-25 12:48:30 -08:00
Rob Norris	3547a358fd	simd: detect and surface support for Intel SHA512 extensions Recent Intel CPUs (starting with Arrow Lake and Lunar Lake) include new vectorised SHA512 instructions. Detect them and make them available to the rest of the system. Note the internal name "sha512ext". This is to disambiguate from other uses of "sha512". Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18233	2026-02-25 12:47:48 -08:00
clefru	6495dafd58	range_tree: use zfs_panic_recover() for partial-overlap remove zfs_range_tree_remove_impl() used a bare panic() when a segment to be removed was not completely overlapped by an existing tree entry. Every other consistency check in range_tree.c uses zfs_panic_recover(), which respects the zfs_recover tunable and allows pools with on-disk corruption to be imported and recovered. This one call was inconsistent, making the partial-overlap case unrecoverable regardless of zfs_recover. Replace panic() with zfs_panic_recover() so that operators can set zfs_recover=1 to import a corrupted pool and reclaim data, consistent with all other range tree error paths. Related-to: https://github.com/openzfs/zfs/issues/13483 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Closes #18255	2026-02-25 11:26:10 -08:00
Alexander Motin	991fc56fae	Introduce dedupused/dedupsaved pool properties Currently there is only a dedup ratio reported via pool properties. If dedup is enabled only for some datasets, it is impossible to say how much space the ratio actually covers. Fix this by introducing dedupused/dedupsaved pool properties, similar to earlier added block cloning ones. Combined with work to expose allocation classes stats, it should give user-space enough visibility to correlate `zpool list` and `zfs list` space numbers. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan.moeller@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18245	2026-02-25 09:41:38 -05:00
Rob Norris	0f608aa6ca	Linux 7.0: add shims for the fs_context-based mount API The traditional mount API has been removed, so detect when its not available and instead use a small adapter to allow our existing mount functions to keep working. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18216	2026-02-23 09:45:12 -08:00
Rob Norris	204de946eb	Linux 7.0: blk_queue_nonrot() renamed to blk_queue_rot() It does exactly the same thing, just inverts the return. Detect its presence or absence and call the right one. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18216	2026-02-23 09:44:20 -08:00
MigeljanImeri	4975430cf5	Add vdev property to disable vdev scheduler Added vdev property to disable the vdev scheduler. The intention behind this property is to improve IOPS performance when using o_direct. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com> Closes #17358	2026-02-23 09:34:33 -08:00
Tony Hutter	d2f5cb3a50	Move range_tree, btree, highbit64 to common code Break out the range_tree, btree, and highbit64/lowbit64 code from kernel space into shared kernel and userspace code. This is needed for the updated `zpool status -vv` error byte range reporting that will be coming in a future commit. That commit needs the range_tree code in kernel and userspace. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18133	2026-02-22 11:43:51 -08:00
Rob Norris	168023b603	Linux 7.0: explicitly set setlease handler to kernel implementation The upcoming 7.0 kernel will no longer fall back to generic_setlease(), instead returning EINVAL if .setlease is NULL. So, we set it explicitly. To ensure that we catch any future kernel change, adds a sanity test for F_SETLEASE and F_GETLEASE too. Since this is a Linux-specific test, also a small adjustment to the test runner to allow OS-specific helper programs. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18215	2026-02-22 11:39:06 -08:00
Alexander Motin	d06a1d9ac3	Fix available space accounting for special/dedup (#18222 ) Currently, spa_dspace (base to calculate dataset AVAIL) only includes the normal allocation class capacity, but dd_used_bytes tracks space allocated across all classes. Since we don't want to report free space of other classes as available (we can't promise new allocations will be able to use it), report only allocated space, similar to how we report space saved by dedup and block cloning. Since we need deflated space here, make allocation classes track deflated allocated space also. While here, make mc_deferred also deflated, matching its use contexts. Also while there, use atomic_load() to read the allocation class stats. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18190 Closes #18222	2026-02-19 10:36:35 -08:00
Tony Hutter	640a217faf	CI: Test & fix Linux ZFS built-in build ZFS can be built directly into the Linux kernel. Add a test build of this to the CI to verify it works. The test build is only enabled on Fedora runners (since they run the newest kernels) and is done in parallel with ZTS. The test build is done on vm2, since it typically finishes ~15min before vm1 and thus has time to spare. In addition: - Update 'copy-builtin' to check that $1 is a directory - Fix some VERIFYs that were causing the built-in build to fail Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18234	2026-02-19 10:15:41 -08:00
Attila Fülöp	c8a72a27e5	ICP: AES-GCM assembly: remove unused Gmul functions In the AES-GCM assembly files we are defining Gmul functions we don't use anywhere. Just remove the dead code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #18226	2026-02-19 10:10:02 -08:00
Alexander Motin	370570890f	Remove parent ZIO from dbuf_prefetch() I am not sure why it was added there 10 years ago, but it seems not needed now. According to my tests removing it improves sequential read performance with recordsize=4K by 5-10% by reducing the CPU overhead in prefetcher. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18214	2026-02-18 18:12:13 -08:00
Attila Fülöp	d489677280	ICP: AES-GCM VAES-AVX2: fix typos and document source files Require AVX2 compiler support and document source files for `aesni-gcm-avx2-vaes.S`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #18225	2026-02-17 16:51:32 -08:00
Attila Fülöp	bee53d8c10	Linux 6.19 compat: in-tree build: fix duplicate GCM assembly functions Linux 6.19 added an AES-GCM VAES-AVX2 assembly implementation. It's basically a translation from the BoringSSL perlasm syntax to macro assembly. We're using the same source but the perlasm generated flat assembly which shares some global function names with the former. When building in-tree this results in the linker failing due to the duplicate symbols. To avoid the error we prepend `icp_` via a macro to our function names. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Moch <mail@alexmoch.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #18204 Closes #18224	2026-02-17 13:09:41 -08:00
Alexander Motin	0f9564e85b	Simplify dnode_level_is_l2cacheable() We should not dereference through dn_handle->dnh_dnode once we already have a dnode pointer. The result will be the same. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18212	2026-02-16 10:34:22 -05:00
Alexander Motin	ba970eb202	Cleanup allocation class selection - For multilevel gang blocks it seemed possible to fallback from normal to special class, since they don't have proper object type, and DMU_OT_NONE is a "metadata". They should never fallback. - Fix possible inversion with zfs_user_indirect_is_special = 0, when indirects written to normal vdev, while small data to special. Make small indirect blocks also follow special_small_blocks there. - With special_small_blocks now applying to both files and ZVOLs, make it apply to all non-metadata without extra checks, since there are no other non-metadata types. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18208	2026-02-16 10:33:21 -05:00
Mariusz Zaborski	cdf89f413c	Flush RRD only when TXGs contain data This change modifies the behavior of spa_sync_time_logger when flushing the RRD database. Previously, once the sync interval elapsed, a flush would always be generated. On solid-state devices, especially when the pool was otherwise idle, this caused disks to wake up solely to write RRD data. Since RRD is best-effort telemetry, this behavior is unnecessary and wasteful. With this change, spa_sync_time_logger delays flushing until a TXG that already contains data is being synced. The RRD update is appended to that TXG instead of forcing the creation of a new write-only TXG. During pool export, flushing is forced regardless of whether the TXG contains user data. At that stage, data durability takes precedence and a write must be issued. Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #18082 Closes #18138	2026-02-11 11:35:45 -08:00
Marc Sladek	cc184fe98b	Fix `send:raw` permission for send `-w -I` When performing an incremental raw send with intermediates (-w -I), the standard 'send' permission was incorrectly required instead of allowing 'send:raw'. This was due to a strict boolean comparison on the 'rawok' flag in zfs_secpolicy_send() with non-boolean value. This change normalizes the 'rawok' variable to be strictly 0/1 and updates the test suite to properly verify delegated raw send behavior. Introduced-by: https://github.com/openzfs/zfs/pull/17543 Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Marc Sladek <marc@sladek.dev> Closes #18198 Closes #18193	2026-02-11 10:30:26 -08:00
Alexander Motin	aa29455dd7	Restrict cloning with different properties While technically its not a problem to clone between datasets with different properties, it might create expectation of new properties being applied during data move, while actually it won't happen. For copies and checksum it may mean incorrect safety expectations. For dedup, compression and special_small_blocks -- performance and space usage. New zfs_bclone_strict_properties tunable controls it. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18180	2026-02-10 09:53:24 -08:00
rmacklem	1412bdc6c2	zfs_vnops_os.c: Move a vput() to after zfs_setattr_dir() Without this patch, the following crash can occur when a file system is configured with "xattr=dir". VNASSERT failed: locked not true at /posix-acl/freebsd-rdma/sys/kern/vfs_subr.c:5786 (assert_vop_locked) hold count flags () flags () lock type zfs: UNLOCKED panic: zfs_dirent_lookup: vnode is not locked but should be cpuid = 3 time = 1770520763 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b vpanic() at vpanic+0x136/frame 0xfffffe00914c8270 panic() at panic+0x43/frame 0xfffffe00914c82d0 assert_vop_locked() at assert_vop_locked+0x78 zfs_dirent_lookup() at zfs_dirent_lookup+0x41 zfs_setattr_dir() at zfs_setattr_dir+0x123 zfs_setattr() at zfs_setattr+0x1389 zfs_freebsd_setattr() at zfs_freebsd_setattr+0x56b VOP_SETATTR_APV() at VOP_SETATTR_APV+0x5d setfown() at setfown+0xb1 kern_fchownat() at kern_fchownat+0x192 This patch fixes the problem by moving the vput() call for attrzp to after the zfs_setattr_dir() call that takes it as an argument. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes: #18188	2026-02-10 09:29:37 -05:00
Alexander Motin	2646bd5585	Allow rewrite skip cloned and snapshotted blocks Rewrite of cloned and snapshotted blocks can allocate additional space, that may be undesired. In some cases it may have sense to still rewrite snapshotted blocks, expecting the snapshots to rotate with time, freeing space. In other cases rewrite of cloned blocks may be acceptable, despite persistent space usage increase. For this reason add them as separate flags to `zfs rewrite`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18179	2026-02-09 10:17:56 -08:00
Brian Behlendorf	ae488e496f	ZTS: update the relevant mmp test cases - mmp_concurrent_import: added test case to verify that concurrent import correctness. The pool may only be imported once. - mmp_exported_import: an activity check is now required for pools which were cleanly exported if the system and pool hostids don't match. - mmp_inactive_import: an activity check is now required for any pool which wasn't cleanly exported, even if the system and pool hostids match. - mmp_on_uberblocks: updated expected uberblocks to take in to account the value MMP_INTERVAL_DEFAULT is set too. - mmp_reset_interval: reduce the number of iterations from 10 to 3. This is sufficient to verify functionality and significantly speeds up the test. - mmp_on_uberblocks: adjust the thresholds and increase the runtime to avoid false positives observed in CI. - Update tests to use 'zhack action idle' instead of ztest to improve the reliability of the tests. - Add additional log_note messages to test cases which have multiple verification steps to make it clear which portion of a test failed when reviewing the logs. - Replace default_setup/cleanup_noexit calls with 'zpool create' and 'zpool destroy' calls to avoid additional unnecessary dataset creation work. - Update activity/noactivity check helper functions to use the ZFS_LOAD_INFO_DEBUG information now available from 'zpool import' to determine if this activity check ran and why. This is more reliable in the CI than measuring the runtime. - Removed all mmp tests from the zts-report.py exceptions list. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-09 09:36:18 -08:00
Brian Behlendorf	20176224ee	mmp: claim sequence id before final import As part of SPA_LOAD_IMPORT add an additional activity check to detect simultaneous imports from different hosts. This check is only required when the timing is such that there's no activity for the the read-only tryimport check to detect. This extra safety chceck operates as follows: 1. Repeats the following MMP check 10 times: a. Write out an MMP uberblock with the best txg and a random sequence id to all primary pool vdevs. b. Verify a minimum number of good writes such that even if the pool appears degraded on the remote host it will see at least one of the updated MMP uberblocks. c. Wait for the MMP interval this leaves a window for other racing hosts to make similar modifications which can be detected. d. Call vdev_uberblock_load() to determine the best uberblock to use, this should be the MMP uberblock just written. e. Verify the txg and random sequeunce number match the MMP uberblock written in 1a. 2. Restore the original MMP uberblocks. This allows the check to be performed again if the pool fails to import for an unrelated reason. This change also includes some refactoring and minor improvements. - Never try loading earlier txgs during import when the import fails with EREMOTEIO or EINTER. These errors don't indicate the txg is damaged but instead that its either in use on a remote host or the import was interactively cancelled. No rewind is also performed for EBADD which can result from a stale trusted config when doing a verbatim import. - Refactor the code for consistent logging of the multihost activity check using spa_load_note() and console messages indicating when the activity check was trigger and the result. - Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier modification of the sequence number in an uberblock. - Added ZFS_LOAD_INFO_DEBUG environment variable which can be set to log to dump to stdout the spa_load_info nvlist returned during import. This is used by the updated mmp test cases to determine if an activity check was run and its result. - Standardize the mmp messages similarly to make it easier to find all the relevent mmp lines in the debug log. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-09 09:36:01 -08:00
Brian Behlendorf	2f048ced4d	mmp: add spa_load_name() for tryimport Tryimport adds a unique prefix to the pool name to avoid name collisions. This makes it awkward to log user-friendly info during a tryimport. Add a spa_load_name() function which can be used to report the unmodified pool name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-09 09:35:03 -08:00
Brian Behlendorf	62a1bf7d19	mmp: move "Starting import" log message Move the "Starting import" log message in to the import block so it's matched with the "Fiinshed importing" debug message. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-09 09:34:57 -08:00
Brian Behlendorf	a9564b1787	mmp: further restrict mmp exported pool check For a cleanly exported pools there exists a small window where both systems may determine it's safe to import the pool and skip the activity check. Only allow the check to be skipped when the last imported hostid matches the systems hostid and the pool was cleanly exported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-09 09:32:58 -08:00
Austin Wise	4f180e095a	Fix activating large_microzap on receive This ensures that the in-memory state of the feature is recorded and that `dsl_dataset_activate_feature` is not called when the feature is already active. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Austin Wise <AustinWise@gmail.com> Closes #18143 Closes #18144	2026-02-05 15:48:03 -08:00
Alexander Motin	21bbe7cb67	Improve caching for dbuf prefetches To avoid read errors with transaction open dmu_tx_check_ioerr() is used to read everything required in advance. But there seems to be a chance for the buffer to evicted from dbuf cache in between, which result in immediate eviction from ARC, which may require additional disk read later in a place where error handling is problematic. To partially workaround this introduce a new flag DMU_IS_PREFETCH, relayed to ARC as ARC_FLAG_PREFETCH \| ARC_FLAG_PRESCIENT_PREFETCH, making ARC delay eviction by at least several seconds, or till the actual read inside the transaction, that will promote it to demand access. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18160	2026-02-04 10:12:32 -08:00
Ameer Hamza	00d69b0f72	arc: remove unused l2df_size and l2df_type from l2arc_data_free_t These fields became unused when ABD was introduced in `a6255b7fc`. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:26 -08:00
Ameer Hamza	d1f290f1ea	L2ARC: Implement DWPD-based rate limiting with adaptive feed intervals Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write speeds and protect SSD endurance. Write rate is constrained by the minimum of l2arc_write_max and DWPD-calculated budget. Devices accumulate unused write budget over 24-hour periods with automatic reset and carry-over. Writes occur in controlled bursts (max 50MB) with adaptive intervals to achieve target rates. Applies after initial device fill. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:07 -08:00
Ameer Hamza	b525525b44	L2ARC: Implement per-device feed threads for parallel writes Transform L2ARC from single global feed thread to per-device threads, enabling parallel writes to multiple L2ARC devices. Each device runs its own feed thread independently, improving multi-device throughput. Previously, a single thread served all devices sequentially; now each device writes concurrently. Threads are created during device addition and torn down on removal. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:02 -08:00
Ameer Hamza	825dc41ad4	L2ARC: Preserve L2HDR in arc_release() for in-flight writes When arc_release() is called on a header with a single buffer and L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar to the arc_hdr_destroy() case). If we destroy the L2HDR here, later arc_write() will allocate a new ABD and call arc_hdr_free_abd(), which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing VERIFY(HDR_HAS_L2HDR(hdr)) to fail. Allocate a new header for the buffer in the single_buf_l2writing case (single buffer + L2_WRITING), leaving the original header with L2HDR intact. The original header becomes an "orphan" (no buffers, no b_pabd) but retains device association for ABD cleanup when l2arc_write_done() completes. The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC makes its own transformed copy via l2arc_apply_transforms(), so the original ABD is not used by the L2 write. The header can be safely reused without allocating a new one. For proper evictable space accounting, arc_buf_remove() must be called before remove_reference() in the single_buf_l2writing path. This ensures arc_evictable_space_increment() (during remove_reference) and arc_evictable_space_decrement() (during destruction) see the same state (b_buf=NULL), preventing accounting leaks that cause module unload to hang with non-zero esize. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:57 -08:00
Ameer Hamza	b8610c3d93	L2ARC: Reorder header destruction for in-flight L2 writes With multiple L2ARC devices, headers can be destroyed asynchronously (e.g., during zpool sync) while L2_WRITING is set. The original code destroyed L2HDR before L1HDR, causing ABDs to lose their device association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called. This caused ABDs to be added to the global free-on-write list without device information. When any L2ARC device completed its write and attempted to free these orphaned ABDs, it would panic on ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was still part of another device's vdev_queue I/O aggregation gang. Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag ABDs with their device for deferred cleanup. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:51 -08:00
Ameer Hamza	2f41b9d865	L2ARC: Implement persistent markers with consistent tail scanning This commit introduces per-sublist persistent markers that eliminate redundant tail scanning between L2ARC iterations, providing significant CPU efficiency improvements. Markers are pre-allocated during device initialization and properly cleaned up during device removal. The implementation uses conditional behavior based on device capacity: small devices (capacity < arc_c) retain original HEAD/TAIL scanning based on ARC warmup state, while large devices (capacity >= arc_c) use the persistent marker approach for optimal CPU efficiency. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:47 -08:00
Ameer Hamza	3523b5f3f9	L2ARC: Implement even-depth multi-sublist scanning The introduction of ARC multilists made L2ARC writing quite random, depending on whether it found something to write in a randomly selected sublist. This created inconsistent write patterns and poor utilization of available sublists leading to uneven cache population. This commit replaces random selection with systematic scanning across all sublists within each burst. Fair headroom distribution ensures even-depth traversal across all sublists until the target write size is reached. Round-robin processing with random starting points eliminates sequential bias while maintaining predictable write behavior. The systematic approach provides consistent L2ARC filling patterns and better utilization of available ARC data across all sublists. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:05:53 -08:00
Brooks Davis	b364720524	nvpair: chase FreeBSD xdrproc_t definition As of FreeBSD 16, xdrproc_t will take exactly two arguments in both kernel and userspace in line with the Linux kernel. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Alan Somers <asomers@freebsd.org> Signed-off-by: Brooks Davis <brooks@capabilitieslimited.co.uk> Closes #18154	2026-01-28 21:41:33 -05:00
Mariusz Zaborski	a157ef62a1	Make sure we can still write data to txg The final txgs are used only to clear out any remaining deferred frees, and we cannot write new data to them. Make sure we do not try to do so. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #18139	2026-01-26 21:33:21 -05:00
Alexander Motin	35b2d39709	Lock db_mtx around arc_release() in couple places * Lock db_mtx around arc_release() in dbuf_release_bp() While this function is called only in sync context, the same buffer can be touched by dbuf_hold_impl() in open context, creating races. All other accesses to arc_release() are already protected by db_mtx, so just take it here too. Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> * Lock db_mtx in sa_byteswap() While SA code seems protected by sa_lock, there is a back door of dmu_objset_userquota_get_ids(), that may hold and access the dbuf without sa_lock, relying only on db_mtx. Taking db_mtx here should protect both the arc_release() and the data for db_buf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18146	2026-01-26 21:32:16 -05:00
Alek P	cd895f0e57	remove thread unsafe debug code causing FreeBSD double free panic Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #18140	2026-01-21 10:00:34 -08:00
Alexander Moch	28291536bc	Zstd: Document update policy Add the Zstd update policy to the subtree README. Also update the documented location of zstd-in.c to match upstream changes, and normalize naming from 'ZSTD' to 'Zstd'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:24 -08:00
Alexander Moch	2d5a9b6a4c	Zstd: Restore SPDX license identifiers When updating Zstandard to version 1.5.7 the SPDX license identifiers were lost. This commit restores them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:18 -08:00
Alexander Moch	e7f9734bc7	Zstd: Fix ASan poisoning for pooled Zstd contexts The Zstd context mempool can reuse buffers that were previously poisoned under AddressSanitizer, leading to false-positive use-after-poison reports during zloop and other stress tests. Explicitly unpoison memory when handing buffers out to Zstd and poison the user-visible region again when buffers are returned to the pool. This makes the allocator ASan-correct while preserving existing pooling behavior. Also fix non-standard void * pointer arithmetic in zstd_free() and remove an early return in zstd_dctx_alloc() so kmem_type/kmem_size are always set on pool hits. This only affects ASan bookkeeping in user space, does not change runtime behavior in non-ASan configurations, and does not affect on-disk formats. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:12 -08:00
Alexander Moch	a2ac9cd606	Zstd: Integrate v1.5.7 into the ZFS build system This commit builds on the previous zstd library update and adds the necessary ZFS integration and build system changes required to make zstd 1.5.7 compile and function correctly. Changes: - Add zstd_preSplit.c (new in 1.5.7) to all build systems. - Enable x86_64 assembly in userspace (huf_decompress_amd64.S). - Disable assembly in kernel for RETHUNK/IBT compatibility. - Disable intrinsics in kernel for EL10 x86_64-v3 baseline. - Disable tracing in kernel builds for AArch64 compatibility. - Fix ZSTD_isError symbol renaming with __asm__ directive. - Rename abs64 to ZSTD_abs64 (FreeBSD kernel conflict). - Fix bitstream.h attributes (MEM_STATIC -> FORCE_INLINE_TEMPLATE). - Remove xxhash.c from BSD build (now header-only). - Update symbol names in zstd_compat_wrapper.h. - Ignore checkstyle for zstd-in.c. Kernel assembly disabled for security mitigation compatibility. User space retains full performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:06 -08:00
Alexander Moch	bbcddb127a	Zstd: Update bundled library to v1.5.7 without further adjustments This commit only replaces the bundled source and does not include any ZFS integration changes. Because the build depends on integration adjustments, it will fail until the accompanying integration commit is applied. Upstream release: https://github.com/facebook/zstd/releases/tag/v1.5.7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:40:37 -08:00

1 2 3 4 5 ...

5190 Commits