mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Alexander Motin	3dcd071b51	Fix available space accounting for special/dedup (#18222 ) Currently, spa_dspace (base to calculate dataset AVAIL) only includes the normal allocation class capacity, but dd_used_bytes tracks space allocated across all classes. Since we don't want to report free space of other classes as available (we can't promise new allocations will be able to use it), report only allocated space, similar to how we report space saved by dedup and block cloning. Since we need deflated space here, make allocation classes track deflated allocated space also. While here, make mc_deferred also deflated, matching its use contexts. Also while there, use atomic_load() to read the allocation class stats. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18190 Closes #18222	2026-02-19 11:14:37 -08:00
Tony Hutter	46500a0803	CI: Test & fix Linux ZFS built-in build ZFS can be built directly into the Linux kernel. Add a test build of this to the CI to verify it works. The test build is only enabled on Fedora runners (since they run the newest kernels) and is done in parallel with ZTS. The test build is done on vm2, since it typically finishes ~15min before vm1 and thus has time to spare. In addition: - Update 'copy-builtin' to check that $1 is a directory - Fix some VERIFYs that were causing the built-in build to fail Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #18234	2026-02-19 11:14:37 -08:00
Attila Fülöp	c629e594e4	Linux 6.19 compat: in-tree build: fix duplicate GCM assembly functions Linux 6.19 added an AES-GCM VAES-AVX2 assembly implementation. It's basically a translation from the BoringSSL perlasm syntax to macro assembly. We're using the same source but the perlasm generated flat assembly which shares some global function names with the former. When building in-tree this results in the linker failing due to the duplicate symbols. To avoid the error we prepend `icp_` via a macro to our function names. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Moch <mail@alexmoch.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #18204 Closes #18224	2026-02-17 13:52:43 -08:00
rmacklem	f83a7864aa	zfs_vnops_os.c: Move a vput() to after zfs_setattr_dir() Without this patch, the following crash can occur when a file system is configured with "xattr=dir". VNASSERT failed: locked not true at /posix-acl/freebsd-rdma/sys/kern/vfs_subr.c:5786 (assert_vop_locked) hold count flags () flags () lock type zfs: UNLOCKED panic: zfs_dirent_lookup: vnode is not locked but should be cpuid = 3 time = 1770520763 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b vpanic() at vpanic+0x136/frame 0xfffffe00914c8270 panic() at panic+0x43/frame 0xfffffe00914c82d0 assert_vop_locked() at assert_vop_locked+0x78 zfs_dirent_lookup() at zfs_dirent_lookup+0x41 zfs_setattr_dir() at zfs_setattr_dir+0x123 zfs_setattr() at zfs_setattr+0x1389 zfs_freebsd_setattr() at zfs_freebsd_setattr+0x56b VOP_SETATTR_APV() at VOP_SETATTR_APV+0x5d setfown() at setfown+0xb1 kern_fchownat() at kern_fchownat+0x192 This patch fixes the problem by moving the vput() call for attrzp to after the zfs_setattr_dir() call that takes it as an argument. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes: #18188	2026-02-17 11:54:58 -08:00
Austin Wise	612d4019f1	Fix activating large_microzap on receive This ensures that the in-memory state of the feature is recorded and that `dsl_dataset_activate_feature` is not called when the feature is already active. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Austin Wise <AustinWise@gmail.com> Closes #18143 Closes #18144	2026-02-17 11:54:58 -08:00
Alexander Motin	25327ed7ce	Improve caching for dbuf prefetches To avoid read errors with transaction open dmu_tx_check_ioerr() is used to read everything required in advance. But there seems to be a chance for the buffer to evicted from dbuf cache in between, which result in immediate eviction from ARC, which may require additional disk read later in a place where error handling is problematic. To partially workaround this introduce a new flag DMU_IS_PREFETCH, relayed to ARC as ARC_FLAG_PREFETCH \| ARC_FLAG_PRESCIENT_PREFETCH, making ARC delay eviction by at least several seconds, or till the actual read inside the transaction, that will promote it to demand access. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18160	2026-02-17 11:54:58 -08:00
Mariusz Zaborski	11647c669e	Flush RRD only when TXGs contain data This change modifies the behavior of spa_sync_time_logger when flushing the RRD database. Previously, once the sync interval elapsed, a flush would always be generated. On solid-state devices, especially when the pool was otherwise idle, this caused disks to wake up solely to write RRD data. Since RRD is best-effort telemetry, this behavior is unnecessary and wasteful. With this change, spa_sync_time_logger delays flushing until a TXG that already contains data is being synced. The RRD update is appended to that TXG instead of forcing the creation of a new write-only TXG. During pool export, flushing is forced regardless of whether the TXG contains user data. At that stage, data durability takes precedence and a write must be issued. Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #18082 Closes #18138	2026-02-11 11:41:13 -08:00
Marc Sladek	a0350f61c4	Fix `send:raw` permission for send `-w -I` When performing an incremental raw send with intermediates (-w -I), the standard 'send' permission was incorrectly required instead of allowing 'send:raw'. This was due to a strict boolean comparison on the 'rawok' flag in zfs_secpolicy_send() with non-boolean value. This change normalizes the 'rawok' variable to be strictly 0/1 and updates the test suite to properly verify delegated raw send behavior. Introduced-by: https://github.com/openzfs/zfs/pull/17543 Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Marc Sladek <marc@sladek.dev> Closes #18198 Closes #18193	2026-02-11 11:41:13 -08:00
Brian Behlendorf	618cfa02ea	ZTS: update the relevant mmp test cases - mmp_concurrent_import: added test case to verify that concurrent import correctness. The pool may only be imported once. - mmp_exported_import: an activity check is now required for pools which were cleanly exported if the system and pool hostids don't match. - mmp_inactive_import: an activity check is now required for any pool which wasn't cleanly exported, even if the system and pool hostids match. - mmp_on_uberblocks: updated expected uberblocks to take in to account the value MMP_INTERVAL_DEFAULT is set too. - mmp_reset_interval: reduce the number of iterations from 10 to 3. This is sufficient to verify functionality and significantly speeds up the test. - mmp_on_uberblocks: adjust the thresholds and increase the runtime to avoid false positives observed in CI. - Update tests to use 'zhack action idle' instead of ztest to improve the reliability of the tests. - Add additional log_note messages to test cases which have multiple verification steps to make it clear which portion of a test failed when reviewing the logs. - Replace default_setup/cleanup_noexit calls with 'zpool create' and 'zpool destroy' calls to avoid additional unnecessary dataset creation work. - Update activity/noactivity check helper functions to use the ZFS_LOAD_INFO_DEBUG information now available from 'zpool import' to determine if this activity check ran and why. This is more reliable in the CI than measuring the runtime. - Removed all mmp tests from the zts-report.py exceptions list. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Brian Behlendorf	c710f87923	mmp: claim sequence id before final import As part of SPA_LOAD_IMPORT add an additional activity check to detect simultaneous imports from different hosts. This check is only required when the timing is such that there's no activity for the the read-only tryimport check to detect. This extra safety chceck operates as follows: 1. Repeats the following MMP check 10 times: a. Write out an MMP uberblock with the best txg and a random sequence id to all primary pool vdevs. b. Verify a minimum number of good writes such that even if the pool appears degraded on the remote host it will see at least one of the updated MMP uberblocks. c. Wait for the MMP interval this leaves a window for other racing hosts to make similar modifications which can be detected. d. Call vdev_uberblock_load() to determine the best uberblock to use, this should be the MMP uberblock just written. e. Verify the txg and random sequeunce number match the MMP uberblock written in 1a. 2. Restore the original MMP uberblocks. This allows the check to be performed again if the pool fails to import for an unrelated reason. This change also includes some refactoring and minor improvements. - Never try loading earlier txgs during import when the import fails with EREMOTEIO or EINTER. These errors don't indicate the txg is damaged but instead that its either in use on a remote host or the import was interactively cancelled. No rewind is also performed for EBADD which can result from a stale trusted config when doing a verbatim import. - Refactor the code for consistent logging of the multihost activity check using spa_load_note() and console messages indicating when the activity check was trigger and the result. - Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier modification of the sequence number in an uberblock. - Added ZFS_LOAD_INFO_DEBUG environment variable which can be set to log to dump to stdout the spa_load_info nvlist returned during import. This is used by the updated mmp test cases to determine if an activity check was run and its result. - Standardize the mmp messages similarly to make it easier to find all the relevent mmp lines in the debug log. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Brian Behlendorf	96ffe51004	mmp: add spa_load_name() for tryimport Tryimport adds a unique prefix to the pool name to avoid name collisions. This makes it awkward to log user-friendly info during a tryimport. Add a spa_load_name() function which can be used to report the unmodified pool name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Brian Behlendorf	f2c40b4586	mmp: move "Starting import" log message Move the "Starting import" log message in to the import block so it's matched with the "Fiinshed importing" debug message. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Brian Behlendorf	e78596e05e	mmp: further restrict mmp exported pool check For a cleanly exported pools there exists a small window where both systems may determine it's safe to import the pool and skip the activity check. Only allow the check to be skipped when the last imported hostid matches the systems hostid and the pool was cleanly exported. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Brooks Davis	7c80abdd7c	nvpair: chase FreeBSD xdrproc_t definition As of FreeBSD 16, xdrproc_t will take exactly two arguments in both kernel and userspace in line with the Linux kernel. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Alan Somers <asomers@freebsd.org> Signed-off-by: Brooks Davis <brooks@capabilitieslimited.co.uk> Closes #18154	2026-02-05 13:48:31 -08:00
Mariusz Zaborski	79c3810088	Make sure we can still write data to txg The final txgs are used only to clear out any remaining deferred frees, and we cannot write new data to them. Make sure we do not try to do so. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #18139	2026-02-05 13:48:31 -08:00
Alexander Motin	554a81b20a	Lock db_mtx around arc_release() in couple places * Lock db_mtx around arc_release() in dbuf_release_bp() While this function is called only in sync context, the same buffer can be touched by dbuf_hold_impl() in open context, creating races. All other accesses to arc_release() are already protected by db_mtx, so just take it here too. Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> * Lock db_mtx in sa_byteswap() While SA code seems protected by sa_lock, there is a back door of dmu_objset_userquota_get_ids(), that may hold and access the dbuf without sa_lock, relying only on db_mtx. Taking db_mtx here should protect both the arc_release() and the data for db_buf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18146	2026-02-05 13:48:31 -08:00
Alek P	aebbfdb37a	remove thread unsafe debug code causing FreeBSD double free panic Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #18140	2026-02-05 13:48:31 -08:00
Mark Johnston	343cc96d7d	FreeBSD: Remove references to DEBUG_VFS_LOCKS This option is removed upstream in favour of plain INVARIANTS. VNASSERT is always defined so I see no reason to use it conditionally. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18136	2026-02-05 13:48:31 -08:00
Martin Matuška	d69f7c5e9b	FreeBSD: unbreak compilation on i386 tests/zfs-tests/cmd/mmap_seek.c: use correct printf specifier module/zfs/vdev.c: vdev_clear(): correctly cast argument to atomic_add_64(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #18096	2026-02-05 13:48:31 -08:00
Alan Somers	d08e561d0a	Fix --enable-invariants on FreeBSD The make symbols were never getting forwarded to the correct make subprocess. As far as I can tell, this has never worked. Either that, or something has changed in the behavior of make. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #18131	2026-02-05 13:48:31 -08:00
shuppy	6218a5eb03	Fix history logging for `zpool create -t` `zpool create` is supposed to log the command to the new pool’s history, as a special record that never gets evicted from the ring buffer. but when you create a pool with `zpool create -t`, no such record is ever logged (#18102). that bug may be the cause of issues like #16408. `zpool create -t` (`83e9986f6e`) and `zpool import -t` (`26b42f3f9d`) are both designed to override the on-disk zpool property `name` with an in-core “temporary” name, but they work somewhat differently under the hood. importing with a temporary name sets `spa->spa_import_flags \|= ZFS_IMPORT_TEMP_NAME` in ZFS_IOC_POOL_IMPORT, which tells spa_write_cachefile() and spa_config_generate() to use the ZPOOL_CONFIG_POOL_NAME in `spa->spa_config` instead of `spa->spa_name`. creating with a temporary name permanently(!) sets the internal zpool property `tname` (ZPOOL_PROP_TNAME) in the `zc->zc_nvlist_src` of ZFS_IOC_POOL_CREATE, which tells zfs_ioc_pool_create() (`4ceb8dd6fd`) and spa_create() to use that name instead of `zc->zc_name`, then sets `spa->spa_import_flags \|= ZFS_IMPORT_TEMP_NAME` like an import. but zfsdev_ioctl_common() fails to check for `tname` when saving the pool name to `zfs_allow_log_key`, so when we call ZFS_IOC_LOG_HISTORY, we call spa_open() on the wrong pool name and get ENOENT, so the logging silently fails. this patch fixes #18102 by checking for `tname` in zfsdev_ioctl_common() like we do in zfs_ioc_pool_create(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: delan azabani <dazabani@igalia.com> Closes #18118 Closes #18102	2026-02-05 13:48:31 -08:00
Alexander Motin	2c9fec38d0	DDT: Add locking for table ZAP destruction Similar to BRT, DDT ZAP can be destroyed by sync context when it becomes empty. Respectively similar to BRT introduce RW-lock to protect open context methods from the destruction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18115	2026-02-05 13:48:31 -08:00
Andrew Walker	6f7f71825f	Add fh_to_parent export definition This commit adds support for converting a file handle to its parent dentry. This is called in exportfs_decode_fh_raw() when subtree checking is enabled in NFS. Defining this and handling the expanded filehandles allows the knfsd to succeed in handling the file handle where it might otherwise fail with ESTALE when trying to open by filehandle. A side effect of this change is that name_to_handle_at(2) and open_by_handle_at(2) now support AT_HANDLE_CONNECTABLE. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Andrew Walker <andrew.walker@truenas.com> Closes #18099	2026-02-05 13:48:31 -08:00
Rob Norris	42c2b2d774	spl: remove a _KERNEL check This code is only compiled for the Linux kernel module, so that define is always set. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-02-05 13:48:31 -08:00
Rob Norris	26fcf5848b	spl: unexport kstat_proc_entry functions These are used to implement the kstat and procfs_list interfaces, and aren't used from outside. There's no need to export them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-02-05 13:48:31 -08:00
Rob Norris	eaa645be5d	spl: lift 64-bit math compat out to separate file It's a lot of rarely-compiled code, so move it to the side to make other code easier to read. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-02-05 13:48:31 -08:00
Rob Norris	b1d3b5e567	spl: remove old atomic lock Long ago, SPL atomics were implemented as a global spinlock over conventional operations. In `5e9b5d832b` (2009-10) they was converted to proper atomics, with the spinlock retained as a fallback. The switch to compile with the fallback was later removed in `a91258913f` (2018-05), but the code it enabled wasn't. So lets do that. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-02-05 13:48:31 -08:00
Dimitry Andric	4cc3056c56	icp: emit .note.GNU-stack section for all ELF targets On FreeBSD, linking the zfs kernel module with binutils ld 2.44 shows the following warning: ld: warning: aesni-gcm-avx2-vaes.o: missing .note.GNU-stack section implies executable stack ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker Some of the `.S` files under `module/icp/asm-x86_64/modes` check whether to emit the `.note.GNU-stack` section using: #if defined(__linux__) && defined(__ELF__) We could add `&& defined(__FreeBSD__)` to the test, but since all other `.S` files in the OpenZFS tree use: #ifdef __ELF__ it would seem more logical to use that instead. Any recent ELF platform should support these note sections by now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #18119	2026-02-05 13:48:31 -08:00
Austin Wise	65e13c33d8	When receiving a stream with the large block flag, activate feature ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS to indicate the presence of large blocks in the dataset. On the sending side, this flag is included if the `-L` flag is passed to `zfs send` and the feature is active in the dataset. On the receive side, the stream is refused if the feature is active in the destination dataset but the stream does not include the feature flag. The problem is the feature is only activated when a large block is born. If a large block has been born in the destination, but never the source, the send can't work. This can arise when sending streams back and forth between two datasets. This commit fixes the problem by always activating the large blocks feature when receiving a stream with the large block feature flag. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Austin Wise <AustinWise@gmail.com> Closes #18105	2026-02-05 13:48:31 -08:00
Jitendra Patidar	8a826c0f68	Fix zfs_open() to skip zil_async_to_sync() for the snapshot Fix zfs_open() to skip zil_async_to_sync() for the snapshot, as it won't have any transactions. zfsvfs->z_log is NULL for the snapshot. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #18091	2026-02-05 13:48:31 -08:00
Andrew Walker	edd3d33433	Add handling for STATX_CHANGE_COOKIE This commit adds handling for the STATX_CHANGE_COOKIE so that we can properly surface the ZFS znode sequence to NFS clients via knfsd. If knfsd does not have STATX_CHANGE_COOKIE in statx result then it will synthesize the NFS change_info4 structure and related change4id values algorithmically based on the ctime value of the file. Since internally ZFS is using ktime_get_coarse_real_ts64() for the timestamp calculation here it introduces the possiblity that the change will not increment the change4id of directories / files causing a failure in the client to invalidate its attr cache (among other things). See RFC 8881 Section 10.8 for discussion of how clients may implement name and directory caching. Notable in this commit is that we are not initializing the inode->i_version to the znode->z_seq number. The reason for this is that we're intentionally not setting `SB_I_VERSION`. This indicates that the filesystem manages its own i_version and so it is not populated in the generic_fillattr. The following compares tight loop of setattr over NFSv4 protocol while traching nfsd4_change_attribute. Before change: inode, change_attribute 4723, 7590032215978780890 4723, 7590032215978780890 4723, 7590032215978780890 4723, 7590032215982780865 4723, 7590032215982780865 After change: inode, change_attribute 7602, 7590032992517123951 7602, 7590032992517123952 7602, 7590032992517123953 7602, 7590032992517123954 7602, 7590032992517123955 Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Andrew Walker <andrew.walker@truenas.com> Closes #18097	2026-02-05 13:48:31 -08:00
Rob Norris	a5d9f233fa	kmem: don't add __GFP_RECLAIMABLE for KM_VMEM allocations vmalloc()'d memory is not movable/reclaimable, so __GFP_RECLAIMABLE is not a valid flag, and since 6.19 the kernel warns if you use it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18107	2026-02-05 13:48:31 -08:00
Rob Norris	cb1833023f	kmem: don't add __GFP_COMP for KM_VMEM allocations It hasn't been necessary since Linux 3.13 (torvalds/linux@a57a49887e), and since 6.19 the kernel warns if you use it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2026-02-05 13:48:31 -08:00
Rob Norris	2422c1f3b9	kmem: don't pass __GFP_HIGHMEM to __vmalloc Since Linux 4.12 (torvalds/linux@19809c2da2) __GFP_HIGHMEM has been automatically added to calls to __vmalloc() internally, so we don't need it anymore. This is good, because since 6.19 the kernel warns if you use __GFP_HIGHMEM. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2026-02-05 13:48:31 -08:00
Rob Norris	ccf956c2b3	Linux 6.19: replace i_state access with inode_state_read_once() Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2026-02-05 13:48:31 -08:00
Alexander Motin	09587c7385	Use reduced precision for scan times Scan time limits do not need precision beyond 1ms. Switching scn_sync_start_time and spa_sync_starttime from gethrtime() to getlrtime() saves ~3% of CPU time during resilver scan stage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18061	2026-02-05 13:48:31 -08:00
Alexander Motin	35ee242abc	Reduce minimal scrub/resilver times With higher throughput and lower latency of modern devices ZFS can happily live with pretty short (fractions of a second) TXGs. But the two decade old multi-second minimal time limits can almost stop payload writes by extending TXGs beyond dirty data limits of ARC ability to amortize it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18060	2026-02-05 13:48:31 -08:00
Mark Maybee	0de2da6a37	Fix rangelock test for growing block size If the file already has more than one block, then the current block size cannot change. But if the file block size is less than the maximum block size supported by the file system, and there are multiple blocks in the file, the current code will almost always extend the rangelock to its maximum size. This means that all writes become serialized and even reads are slowed as they will more often contend with writes. This commit adjusts the test so that we will not lock the entire range if there is more than one block in the file already. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Maybee <mark.maybee@perforce.com> Closes #18046 Closes #18064	2026-02-05 13:48:31 -08:00
Alexander Motin	8d391531eb	Bypass snprintf() in quota checks if no quotas set This improves synthetic 1 byte write speed by ~2.5%. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18063	2026-02-05 13:48:31 -08:00
Alexander Motin	8dd01181aa	RAIDZ: Remove some excessive logging There were some per I/O logging into dbgmsg in RAIDZ code, that increased CPU load and wiped useful content out of dbgmsg, for example during routine disk replacement process. I don't think we need it to be that verbose. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18059	2026-02-05 13:48:31 -08:00
Alexander Motin	96b1d2fae9	DDT: Fix compressed entry buffer size The first byte of the entry after compression is used for algorithm and byte order flag. We should decrement when calling compression/ decompression algorithm. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18055	2026-02-05 13:48:31 -08:00
Alexander Motin	4ab2027f59	DDT: Add/use zap_lookup_length_uint64_by_dnode() Unlike other ZAP consumers due to compression DDT does not know how big entry it is reading from ZAP. Due to this it called zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(), each of which does full ZAP entry lookup. Introduction of the combined ZAP method dramatically reduces the CPU overhead and locks contention at DBUF layer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18048	2026-02-05 13:48:31 -08:00
Alexander Motin	4905686e67	DDT: Switch to using ZAP _by_dnode() interfaces As was previously done for BRT, avoid holding/releasing DDT ZAP dnodes for every access. Instead hold the dnodes during all their life time, never releasing. While at this, add _by_dnode() interfaces for zap_length_uint64() and zap_count(), actively used by DDT code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18047	2026-02-05 13:48:31 -08:00
Alexander Motin	fa857113a3	DDT: Move logs searches out of the lock Postponing entry removal from the DDT log in case of hit till later single-threaded sync stage allows to make ddl_tree stable during multi-threaded ZIO processing stage. It allows to drop the DDT lock before the search instead of after, reducing the contention a lot. Actually ddt_log_update_entry() was already handling the case of entry present in the active log, so we only need to remove it from flushing log, if the entry happen to be there. My tests with parallel 4KB block writes show throughput increase from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even though still limited by the global DDT lock contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18044	2026-02-05 13:48:31 -08:00
Alexander Motin	2428043709	Improve async destroy processing timing Previous code effectively enforced that all async free ZIOs were _issued_ within the TXG timeout. But they could take forever to complete, especially if the required metadata were not in ARC. This patch introduces periodic waits every 2000 ZIOs, which should give at least somewhat reasonable TXG timings even for single HDD pools with empty ARC. And makes them complete within half of the TXG timeout, since we might still need time to sync DDT and BRT. While there, change zfs_max_async_dedup_frees semantics to include also clone and gang blocks, which are similar. Bump the default value from set long ago to be more forgiving to block cloning (still not having logs and benefiting from large TXGs), now that we have better working time limits. The limit now is a possible amount of dirty data produced by BRT updates. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18043	2026-02-05 13:48:31 -08:00
Alexander Motin	135103a648	Defer async destroys on pool import We've observed a number of cases when pool import stuck for many minutes due to large async destroy trying to load DDT or BRT from HDD pool. While proper destroy dosage is a separate problem, lets give import process a chance to complete before that at all. It may be not enough if there is a lot of ZIL to replay, but that is harder to cover, since those are in separate syscalls. Code investigation shown that we already have this mechanism used for scrub/resilver, so this patch converts SCAN_IMPORT_WAIT_TXGS into a tunable and applies it to async destroys also. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18033	2026-02-05 13:48:30 -08:00
Alexander Motin	8a79d09680	ARC: Increase parallel eviction batching Before parallel eviction implementation zfs_arc_evict_batch_limit caused loop exits after evicting 10 headers. The cost of it is not big and well motivated. Now though taskq task exit after the same 10 headers is much more expensive. To cover the context switch overhead of taskq introduce another level of batching, controlled by zfs_arc_evict_batches_limit tunable, used only for parallel eviction. My tests including 36 parallel reads with 4KB recordsize that shown 1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention, now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17970	2026-02-05 13:48:30 -08:00
Alexander Motin	5e0f20088d	ARC: Pre-convert zfs_arc_min_prefetch_ms There is no need to do MSEC_TO_TICK() for each evicted ARC header. We can do it when tunables are set, since we already have separate internal variables for those. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17965	2026-02-05 13:48:30 -08:00
Alexander Motin	6482a27e81	Reduce dataset buffers re-dirtying For each block written or freed ZFS dirties ds_dbuf of the dataset. While dbuf_dirty() has a fast path for already dirty dbufs, it still require taking the lock and doing some things visible in profiler. Investigation shown ds_dbuf dirtying by dsl_dataset_block_born() and some of dsl_dataset_block_kill() are just not needed, since by the time they are called in sync context the ds_dbuf is already dirtied by dsl_dataset_sync(). Tests show this reducing large file deletion time by ~3% by saving CPU time of single-threaded part of the sync thread. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18028	2026-02-05 13:48:30 -08:00
Ameer Hamza	0bcbee6040	Fix snapshot automount race causing duplicate mounts and AVL tree panic Multiple threads racing to automount the same snapshot can both spawn mount helper processes that successfully complete, causing both parent threads to attempt AVL tree registration and triggering a VERIFY() panic in avl_add(). This occurs because the fsconfig/fsmount API lacks the serialization provided by traditional mount() via lock_mount(). The fix adds a per-entry mutex (se_mtx) to zfs_snapentry_t that serializes mount and unmount operations on the same snapshot. The first mount thread creates a pending entry with se_spa=NULL and holds se_mtx during the helper execution. Concurrent mounts find the pending entry and return success without spawning duplicate helpers. Unmount waits on se_mtx if a mount is pending, ensuring proper serialization. This allows different snapshots to mount in parallel while preventing the AVL panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17943	2025-12-10 10:21:29 -08:00

1 2 3 4 5 ...

5150 Commits