mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-13 07:01:46 +03:00

Author	SHA1	Message	Date
Ameer Hamza	00d69b0f72	arc: remove unused l2df_size and l2df_type from l2arc_data_free_t These fields became unused when ABD was introduced in `a6255b7fc`. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:26 -08:00
Ameer Hamza	d1f290f1ea	L2ARC: Implement DWPD-based rate limiting with adaptive feed intervals Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write speeds and protect SSD endurance. Write rate is constrained by the minimum of l2arc_write_max and DWPD-calculated budget. Devices accumulate unused write budget over 24-hour periods with automatic reset and carry-over. Writes occur in controlled bursts (max 50MB) with adaptive intervals to achieve target rates. Applies after initial device fill. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:07 -08:00
Ameer Hamza	b525525b44	L2ARC: Implement per-device feed threads for parallel writes Transform L2ARC from single global feed thread to per-device threads, enabling parallel writes to multiple L2ARC devices. Each device runs its own feed thread independently, improving multi-device throughput. Previously, a single thread served all devices sequentially; now each device writes concurrently. Threads are created during device addition and torn down on removal. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:07:02 -08:00
Ameer Hamza	825dc41ad4	L2ARC: Preserve L2HDR in arc_release() for in-flight writes When arc_release() is called on a header with a single buffer and L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar to the arc_hdr_destroy() case). If we destroy the L2HDR here, later arc_write() will allocate a new ABD and call arc_hdr_free_abd(), which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing VERIFY(HDR_HAS_L2HDR(hdr)) to fail. Allocate a new header for the buffer in the single_buf_l2writing case (single buffer + L2_WRITING), leaving the original header with L2HDR intact. The original header becomes an "orphan" (no buffers, no b_pabd) but retains device association for ABD cleanup when l2arc_write_done() completes. The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC makes its own transformed copy via l2arc_apply_transforms(), so the original ABD is not used by the L2 write. The header can be safely reused without allocating a new one. For proper evictable space accounting, arc_buf_remove() must be called before remove_reference() in the single_buf_l2writing path. This ensures arc_evictable_space_increment() (during remove_reference) and arc_evictable_space_decrement() (during destruction) see the same state (b_buf=NULL), preventing accounting leaks that cause module unload to hang with non-zero esize. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:57 -08:00
Ameer Hamza	b8610c3d93	L2ARC: Reorder header destruction for in-flight L2 writes With multiple L2ARC devices, headers can be destroyed asynchronously (e.g., during zpool sync) while L2_WRITING is set. The original code destroyed L2HDR before L1HDR, causing ABDs to lose their device association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called. This caused ABDs to be added to the global free-on-write list without device information. When any L2ARC device completed its write and attempted to free these orphaned ABDs, it would panic on ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was still part of another device's vdev_queue I/O aggregation gang. Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag ABDs with their device for deferred cleanup. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:51 -08:00
Ameer Hamza	2f41b9d865	L2ARC: Implement persistent markers with consistent tail scanning This commit introduces per-sublist persistent markers that eliminate redundant tail scanning between L2ARC iterations, providing significant CPU efficiency improvements. Markers are pre-allocated during device initialization and properly cleaned up during device removal. The implementation uses conditional behavior based on device capacity: small devices (capacity < arc_c) retain original HEAD/TAIL scanning based on ARC warmup state, while large devices (capacity >= arc_c) use the persistent marker approach for optimal CPU efficiency. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:47 -08:00
Ameer Hamza	3523b5f3f9	L2ARC: Implement even-depth multi-sublist scanning The introduction of ARC multilists made L2ARC writing quite random, depending on whether it found something to write in a randomly selected sublist. This created inconsistent write patterns and poor utilization of available sublists leading to uneven cache population. This commit replaces random selection with systematic scanning across all sublists within each burst. Fair headroom distribution ensures even-depth traversal across all sublists until the target write size is reached. Round-robin processing with random starting points eliminates sequential bias while maintaining predictable write behavior. The systematic approach provides consistent L2ARC filling patterns and better utilization of available ARC data across all sublists. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:05:53 -08:00
Brooks Davis	b364720524	nvpair: chase FreeBSD xdrproc_t definition As of FreeBSD 16, xdrproc_t will take exactly two arguments in both kernel and userspace in line with the Linux kernel. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Alan Somers <asomers@freebsd.org> Signed-off-by: Brooks Davis <brooks@capabilitieslimited.co.uk> Closes #18154	2026-01-28 21:41:33 -05:00
Mariusz Zaborski	a157ef62a1	Make sure we can still write data to txg The final txgs are used only to clear out any remaining deferred frees, and we cannot write new data to them. Make sure we do not try to do so. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #18139	2026-01-26 21:33:21 -05:00
Alexander Motin	35b2d39709	Lock db_mtx around arc_release() in couple places * Lock db_mtx around arc_release() in dbuf_release_bp() While this function is called only in sync context, the same buffer can be touched by dbuf_hold_impl() in open context, creating races. All other accesses to arc_release() are already protected by db_mtx, so just take it here too. Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> * Lock db_mtx in sa_byteswap() While SA code seems protected by sa_lock, there is a back door of dmu_objset_userquota_get_ids(), that may hold and access the dbuf without sa_lock, relying only on db_mtx. Taking db_mtx here should protect both the arc_release() and the data for db_buf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18146	2026-01-26 21:32:16 -05:00
Alek P	cd895f0e57	remove thread unsafe debug code causing FreeBSD double free panic Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #18140	2026-01-21 10:00:34 -08:00
Alexander Moch	28291536bc	Zstd: Document update policy Add the Zstd update policy to the subtree README. Also update the documented location of zstd-in.c to match upstream changes, and normalize naming from 'ZSTD' to 'Zstd'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:24 -08:00
Alexander Moch	2d5a9b6a4c	Zstd: Restore SPDX license identifiers When updating Zstandard to version 1.5.7 the SPDX license identifiers were lost. This commit restores them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:18 -08:00
Alexander Moch	e7f9734bc7	Zstd: Fix ASan poisoning for pooled Zstd contexts The Zstd context mempool can reuse buffers that were previously poisoned under AddressSanitizer, leading to false-positive use-after-poison reports during zloop and other stress tests. Explicitly unpoison memory when handing buffers out to Zstd and poison the user-visible region again when buffers are returned to the pool. This makes the allocator ASan-correct while preserving existing pooling behavior. Also fix non-standard void * pointer arithmetic in zstd_free() and remove an early return in zstd_dctx_alloc() so kmem_type/kmem_size are always set on pool hits. This only affects ASan bookkeeping in user space, does not change runtime behavior in non-ASan configurations, and does not affect on-disk formats. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:12 -08:00
Alexander Moch	a2ac9cd606	Zstd: Integrate v1.5.7 into the ZFS build system This commit builds on the previous zstd library update and adds the necessary ZFS integration and build system changes required to make zstd 1.5.7 compile and function correctly. Changes: - Add zstd_preSplit.c (new in 1.5.7) to all build systems. - Enable x86_64 assembly in userspace (huf_decompress_amd64.S). - Disable assembly in kernel for RETHUNK/IBT compatibility. - Disable intrinsics in kernel for EL10 x86_64-v3 baseline. - Disable tracing in kernel builds for AArch64 compatibility. - Fix ZSTD_isError symbol renaming with __asm__ directive. - Rename abs64 to ZSTD_abs64 (FreeBSD kernel conflict). - Fix bitstream.h attributes (MEM_STATIC -> FORCE_INLINE_TEMPLATE). - Remove xxhash.c from BSD build (now header-only). - Update symbol names in zstd_compat_wrapper.h. - Ignore checkstyle for zstd-in.c. Kernel assembly disabled for security mitigation compatibility. User space retains full performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:41:06 -08:00
Alexander Moch	bbcddb127a	Zstd: Update bundled library to v1.5.7 without further adjustments This commit only replaces the bundled source and does not include any ZFS integration changes. Because the build depends on integration adjustments, it will fail until the accompanying integration commit is applied. Upstream release: https://github.com/facebook/zstd/releases/tag/v1.5.7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Moch <mail@alexmoch.com> Closes #18089	2026-01-20 13:40:37 -08:00
Mark Johnston	54b141fab5	FreeBSD: Remove references to DEBUG_VFS_LOCKS This option is removed upstream in favour of plain INVARIANTS. VNASSERT is always defined so I see no reason to use it conditionally. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18136	2026-01-19 08:55:17 -08:00
Martin Matuška	8605bdfdda	FreeBSD: unbreak compilation on i386 tests/zfs-tests/cmd/mmap_seek.c: use correct printf specifier module/zfs/vdev.c: vdev_clear(): correctly cast argument to atomic_add_64(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #18096	2026-01-14 17:02:41 -08:00
Alan Somers	3fffe4e707	Fix --enable-invariants on FreeBSD The make symbols were never getting forwarded to the correct make subprocess. As far as I can tell, this has never worked. Either that, or something has changed in the behavior of make. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #18131	2026-01-14 14:54:12 -08:00
shuppy	09e4e01e93	Fix history logging for `zpool create -t` `zpool create` is supposed to log the command to the new pool’s history, as a special record that never gets evicted from the ring buffer. but when you create a pool with `zpool create -t`, no such record is ever logged (#18102). that bug may be the cause of issues like #16408. `zpool create -t` (`83e9986f6e`) and `zpool import -t` (`26b42f3f9d`) are both designed to override the on-disk zpool property `name` with an in-core “temporary” name, but they work somewhat differently under the hood. importing with a temporary name sets `spa->spa_import_flags \|= ZFS_IMPORT_TEMP_NAME` in ZFS_IOC_POOL_IMPORT, which tells spa_write_cachefile() and spa_config_generate() to use the ZPOOL_CONFIG_POOL_NAME in `spa->spa_config` instead of `spa->spa_name`. creating with a temporary name permanently(!) sets the internal zpool property `tname` (ZPOOL_PROP_TNAME) in the `zc->zc_nvlist_src` of ZFS_IOC_POOL_CREATE, which tells zfs_ioc_pool_create() (`4ceb8dd6fd`) and spa_create() to use that name instead of `zc->zc_name`, then sets `spa->spa_import_flags \|= ZFS_IMPORT_TEMP_NAME` like an import. but zfsdev_ioctl_common() fails to check for `tname` when saving the pool name to `zfs_allow_log_key`, so when we call ZFS_IOC_LOG_HISTORY, we call spa_open() on the wrong pool name and get ENOENT, so the logging silently fails. this patch fixes #18102 by checking for `tname` in zfsdev_ioctl_common() like we do in zfs_ioc_pool_create(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: delan azabani <dazabani@igalia.com> Closes #18118 Closes #18102	2026-01-14 14:51:51 -08:00
Alexander Motin	765929cb4e	DDT: Add locking for table ZAP destruction Similar to BRT, DDT ZAP can be destroyed by sync context when it becomes empty. Respectively similar to BRT introduce RW-lock to protect open context methods from the destruction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18115	2026-01-13 15:07:15 -08:00
Andrew Walker	aca58dbb65	Add fh_to_parent export definition This commit adds support for converting a file handle to its parent dentry. This is called in exportfs_decode_fh_raw() when subtree checking is enabled in NFS. Defining this and handling the expanded filehandles allows the knfsd to succeed in handling the file handle where it might otherwise fail with ESTALE when trying to open by filehandle. A side effect of this change is that name_to_handle_at(2) and open_by_handle_at(2) now support AT_HANDLE_CONNECTABLE. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Andrew Walker <andrew.walker@truenas.com> Closes #18099	2026-01-08 15:06:12 -08:00
Rob Norris	f2b4ed3fe5	spl: remove a _KERNEL check This code is only compiled for the Linux kernel module, so that define is always set. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-01-08 10:33:44 -08:00
Rob Norris	02a631139f	spl: unexport kstat_proc_entry functions These are used to implement the kstat and procfs_list interfaces, and aren't used from outside. There's no need to export them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-01-08 10:33:37 -08:00
Rob Norris	662f33f323	spl: lift 64-bit math compat out to separate file It's a lot of rarely-compiled code, so move it to the side to make other code easier to read. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-01-08 10:33:32 -08:00
Rob Norris	2ca6e880da	spl: remove old atomic lock Long ago, SPL atomics were implemented as a global spinlock over conventional operations. In `5e9b5d832b` (2009-10) they was converted to proper atomics, with the spinlock retained as a fallback. The switch to compile with the fallback was later removed in `a91258913f` (2018-05), but the code it enabled wasn't. So lets do that. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18117	2026-01-08 10:33:14 -08:00
Dimitry Andric	2f1f25217f	icp: emit .note.GNU-stack section for all ELF targets On FreeBSD, linking the zfs kernel module with binutils ld 2.44 shows the following warning: ld: warning: aesni-gcm-avx2-vaes.o: missing .note.GNU-stack section implies executable stack ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker Some of the `.S` files under `module/icp/asm-x86_64/modes` check whether to emit the `.note.GNU-stack` section using: #if defined(__linux__) && defined(__ELF__) We could add `&& defined(__FreeBSD__)` to the test, but since all other `.S` files in the OpenZFS tree use: #ifdef __ELF__ it would seem more logical to use that instead. Any recent ELF platform should support these note sections by now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #18119	2026-01-08 09:21:12 -08:00
Austin Wise	794f1587db	When receiving a stream with the large block flag, activate feature ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS to indicate the presence of large blocks in the dataset. On the sending side, this flag is included if the `-L` flag is passed to `zfs send` and the feature is active in the dataset. On the receive side, the stream is refused if the feature is active in the destination dataset but the stream does not include the feature flag. The problem is the feature is only activated when a large block is born. If a large block has been born in the destination, but never the source, the send can't work. This can arise when sending streams back and forth between two datasets. This commit fixes the problem by always activating the large blocks feature when receiving a stream with the large block feature flag. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Austin Wise <AustinWise@gmail.com> Closes #18105	2026-01-07 16:47:12 -08:00
Jitendra Patidar	2301755dfb	Fix zfs_open() to skip zil_async_to_sync() for the snapshot Fix zfs_open() to skip zil_async_to_sync() for the snapshot, as it won't have any transactions. zfsvfs->z_log is NULL for the snapshot. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #18091	2026-01-06 10:58:56 -08:00
Wolfgang Hoschek	c77f17b750	Add snapshots_changed_nsecs dataset property Add a read-only dataset property, snapshots_changed_nsecs, which exposes the nanosecond resolution version of snapshots_changed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Wolfgang Hoschek <wolfgang.hoschek@mac.com> Closes #17998 Closes #18031	2026-01-06 09:36:20 -08:00
Andrew Walker	312bdab0f5	Add handling for STATX_CHANGE_COOKIE This commit adds handling for the STATX_CHANGE_COOKIE so that we can properly surface the ZFS znode sequence to NFS clients via knfsd. If knfsd does not have STATX_CHANGE_COOKIE in statx result then it will synthesize the NFS change_info4 structure and related change4id values algorithmically based on the ctime value of the file. Since internally ZFS is using ktime_get_coarse_real_ts64() for the timestamp calculation here it introduces the possiblity that the change will not increment the change4id of directories / files causing a failure in the client to invalidate its attr cache (among other things). See RFC 8881 Section 10.8 for discussion of how clients may implement name and directory caching. Notable in this commit is that we are not initializing the inode->i_version to the znode->z_seq number. The reason for this is that we're intentionally not setting `SB_I_VERSION`. This indicates that the filesystem manages its own i_version and so it is not populated in the generic_fillattr. The following compares tight loop of setattr over NFSv4 protocol while traching nfsd4_change_attribute. Before change: inode, change_attribute 4723, 7590032215978780890 4723, 7590032215978780890 4723, 7590032215978780890 4723, 7590032215982780865 4723, 7590032215982780865 After change: inode, change_attribute 7602, 7590032992517123951 7602, 7590032992517123952 7602, 7590032992517123953 7602, 7590032992517123954 7602, 7590032992517123955 Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Andrew Walker <andrew.walker@truenas.com> Closes #18097	2026-01-05 14:06:28 -08:00
Rob Norris	a1319bf654	kmem: don't add __GFP_RECLAIMABLE for KM_VMEM allocations vmalloc()'d memory is not movable/reclaimable, so __GFP_RECLAIMABLE is not a valid flag, and since 6.19 the kernel warns if you use it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18107	2026-01-05 13:35:13 -08:00
Rob Norris	f041375b52	kmem: don't add __GFP_COMP for KM_VMEM allocations It hasn't been necessary since Linux 3.13 (torvalds/linux@a57a49887e), and since 6.19 the kernel warns if you use it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2025-12-23 12:54:34 -08:00
Rob Norris	f95e306266	kmem: don't pass __GFP_HIGHMEM to __vmalloc Since Linux 4.12 (torvalds/linux@19809c2da2) __GFP_HIGHMEM has been automatically added to calls to __vmalloc() internally, so we don't need it anymore. This is good, because since 6.19 the kernel warns if you use __GFP_HIGHMEM. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2025-12-23 12:54:11 -08:00
Rob Norris	3c8665cb5d	Linux 6.19: replace i_state access with inode_state_read_once() Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2025-12-23 12:53:32 -08:00
Ivan Shapovalov	9880ac3080	zvol: cosmetic: fix up `volthreading` property short name Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>	2025-12-23 11:12:21 -08:00
Rob Norris	654e7628d6	u8_textprep: move into module/zfs Now that it's built into the main zfs module in all cases, there's no reason to put it in its own dir. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18071	2025-12-22 14:58:36 -08:00
Alexander Motin	962e68865e	Use reduced precision for scan times Scan time limits do not need precision beyond 1ms. Switching scn_sync_start_time and spa_sync_starttime from gethrtime() to getlrtime() saves ~3% of CPU time during resilver scan stage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18061	2025-12-18 10:22:11 -08:00
Alexander Motin	a83bb15fcd	Reduce minimal scrub/resilver times With higher throughput and lower latency of modern devices ZFS can happily live with pretty short (fractions of a second) TXGs. But the two decade old multi-second minimal time limits can almost stop payload writes by extending TXGs beyond dirty data limits of ARC ability to amortize it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18060	2025-12-18 10:21:45 -08:00
Mark Maybee	7ff329ac2e	Fix rangelock test for growing block size If the file already has more than one block, then the current block size cannot change. But if the file block size is less than the maximum block size supported by the file system, and there are multiple blocks in the file, the current code will almost always extend the rangelock to its maximum size. This means that all writes become serialized and even reads are slowed as they will more often contend with writes. This commit adjusts the test so that we will not lock the entire range if there is more than one block in the file already. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Maybee <mark.maybee@perforce.com> Closes #18046 Closes #18064	2025-12-18 09:23:38 -08:00
Alexander Motin	051a8c7494	Bypass snprintf() in quota checks if no quotas set This improves synthetic 1 byte write speed by ~2.5%. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18063	2025-12-17 21:59:47 -05:00
Alexander Motin	0550abd4b8	RAIDZ: Remove some excessive logging There were some per I/O logging into dbgmsg in RAIDZ code, that increased CPU load and wiped useful content out of dbgmsg, for example during routine disk replacement process. I don't think we need it to be that verbose. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18059	2025-12-17 14:00:01 -08:00
Alexander Motin	22e89aca88	DDT: Fix compressed entry buffer size The first byte of the entry after compression is used for algorithm and byte order flag. We should decrement when calling compression/ decompression algorithm. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18055	2025-12-15 14:52:44 -08:00
Alexander Motin	3b1ff816bd	DDT: Add/use zap_lookup_length_uint64_by_dnode() Unlike other ZAP consumers due to compression DDT does not know how big entry it is reading from ZAP. Due to this it called zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(), each of which does full ZAP entry lookup. Introduction of the combined ZAP method dramatically reduces the CPU overhead and locks contention at DBUF layer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18048	2025-12-15 14:38:34 -08:00
Alexander Motin	ff5414406f	DDT: Switch to using ZAP _by_dnode() interfaces As was previously done for BRT, avoid holding/releasing DDT ZAP dnodes for every access. Instead hold the dnodes during all their life time, never releasing. While at this, add _by_dnode() interfaces for zap_length_uint64() and zap_count(), actively used by DDT code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18047	2025-12-15 09:49:14 -08:00
Alexander Motin	46d6f1fe56	DDT: Move logs searches out of the lock Postponing entry removal from the DDT log in case of hit till later single-threaded sync stage allows to make ddl_tree stable during multi-threaded ZIO processing stage. It allows to drop the DDT lock before the search instead of after, reducing the contention a lot. Actually ddt_log_update_entry() was already handling the case of entry present in the active log, so we only need to remove it from flushing log, if the entry happen to be there. My tests with parallel 4KB block writes show throughput increase from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even though still limited by the global DDT lock contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18044	2025-12-15 09:17:04 -08:00
Alexander Motin	3d76ba2737	Improve async destroy processing timing Previous code effectively enforced that all async free ZIOs were _issued_ within the TXG timeout. But they could take forever to complete, especially if the required metadata were not in ARC. This patch introduces periodic waits every 2000 ZIOs, which should give at least somewhat reasonable TXG timings even for single HDD pools with empty ARC. And makes them complete within half of the TXG timeout, since we might still need time to sync DDT and BRT. While there, change zfs_max_async_dedup_frees semantics to include also clone and gang blocks, which are similar. Bump the default value from set long ago to be more forgiving to block cloning (still not having logs and benefiting from large TXGs), now that we have better working time limits. The limit now is a possible amount of dirty data produced by BRT updates. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18043	2025-12-11 18:46:08 -08:00
Alexander Motin	f72fd378c8	Defer async destroys on pool import We've observed a number of cases when pool import stuck for many minutes due to large async destroy trying to load DDT or BRT from HDD pool. While proper destroy dosage is a separate problem, lets give import process a chance to complete before that at all. It may be not enough if there is a lot of ZIL to replay, but that is harder to cover, since those are in separate syscalls. Code investigation shown that we already have this mechanism used for scrub/resilver, so this patch converts SCAN_IMPORT_WAIT_TXGS into a tunable and applies it to async destroys also. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18033	2025-12-11 18:44:46 -08:00
Alexander Motin	d393166c54	ARC: Increase parallel eviction batching Before parallel eviction implementation zfs_arc_evict_batch_limit caused loop exits after evicting 10 headers. The cost of it is not big and well motivated. Now though taskq task exit after the same 10 headers is much more expensive. To cover the context switch overhead of taskq introduce another level of batching, controlled by zfs_arc_evict_batches_limit tunable, used only for parallel eviction. My tests including 36 parallel reads with 4KB recordsize that shown 1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention, now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17970	2025-12-10 13:03:01 -08:00
Rob Norris	9fdb854109	Linux: work around use of GPL-only symbol `kasan_flag_enabled` We may not be able to avoid our code referencing the symbol, but we can ensure that a symbol of that name is available to the linker during build, and so not require linking the GPL-exported version. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18009 Closes #18040	2025-12-10 10:04:57 -08:00
Chunwei Chen	0c194352b5	Fix ddtprune causing space leak In zio_ddt_free, if a pruned dde is still in ddt, it would do nothing and cause space leak. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #17982 Closes #17983	2025-12-10 10:02:14 -08:00
Alex	104da9657a	Fix a declaration position of the nth_page. Compilation time bug introduced by `87df5e4` commit. Fix for the compilation error(Linux kernel 6.18.0): "zfs/module/os/linux/zfs/abd_os.c:920:32: error: implicit declaration of function ‘nth_page’; did you mean ‘pte_page’? [-Werror=implicit-function-declaration]". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: agiUnderground <alex.dev.cv@gmail.com> Closes #18034	2025-12-09 15:45:51 -08:00
Alexander Motin	a62c62120e	ARC: Pre-convert zfs_arc_min_prefetch_ms There is no need to do MSEC_TO_TICK() for each evicted ARC header. We can do it when tunables are set, since we already have separate internal variables for those. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17965	2025-12-09 12:07:10 -08:00
Alexander Motin	09492e0f21	Reduce dataset buffers re-dirtying For each block written or freed ZFS dirties ds_dbuf of the dataset. While dbuf_dirty() has a fast path for already dirty dbufs, it still require taking the lock and doing some things visible in profiler. Investigation shown ds_dbuf dirtying by dsl_dataset_block_born() and some of dsl_dataset_block_kill() are just not needed, since by the time they are called in sync context the ds_dbuf is already dirtied by dsl_dataset_sync(). Tests show this reducing large file deletion time by ~3% by saving CPU time of single-threaded part of the sync thread. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18028	2025-12-09 09:18:09 -08:00
bspengler-oss	060bc8b70d	Fix HIGHMEM/kmap API violation in zfs_uiomove_bvec_impl() Fix another instance where ZFS assumes multiple pages can be mapped at once via zfs_kmap_local(), resulting in crashes and potential memory corruption on HIGHMEM-enabled (typically 32-bit) systems. Reviewed-by: RageLtMan <rageltman@sempervictus> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com> Closes #15668 Closes #18030	2025-12-09 09:12:24 -08:00
bspengler-oss	2cab0554c0	Preserve LIFO ordering of kmap ops in abd_raidz_gen_iterate() ZFS typically preserves proper LIFO ordering regarding map/unmap operations that wrap the Linux kernel's kmap interfaces that require such ordering, but one instance in abd_raidz_gen_iterate() did not. Similar issues have been fixed in the Linux kernel in the past, see for instance CVE-2025-39899 for userfaultfd. Reviewed-by: RageLtMan <rageltman@sempervictus> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com> Closes #15668 Closes #18030	2025-12-09 09:12:16 -08:00
bspengler-oss	87df5e4872	Fix interaction of abd_iter_map()/abd_iter_unmap() with HIGHMEM HIGHMEM kmap interfaces operate on only a single page at a time yet ZFS hadn't accounted for this, resulting in crashes and potential memory corruption on HIGHMEM (typically 32-bit) systems. This was caught by PaX's KERNSEAL feature as it makes use of HIGHMEM functionality on x64. On typical 64-bit systems, this issue wouldn't have been observed, as the map interfaces simply fall back to returning an address in lowmem where the contiguous pages can be accessed directly. Joint work with the PaX Team, tested by Mark van Dijk Reviewed-by: RageLtMan <rageltman@sempervictus> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com> Closes #15668 Closes #18030	2025-12-09 09:10:32 -08:00
Ameer Hamza	4ce030e025	Fix snapshot automount race causing duplicate mounts and AVL tree panic Multiple threads racing to automount the same snapshot can both spawn mount helper processes that successfully complete, causing both parent threads to attempt AVL tree registration and triggering a VERIFY() panic in avl_add(). This occurs because the fsconfig/fsmount API lacks the serialization provided by traditional mount() via lock_mount(). The fix adds a per-entry mutex (se_mtx) to zfs_snapentry_t that serializes mount and unmount operations on the same snapshot. The first mount thread creates a pending entry with se_spa=NULL and holds se_mtx during the helper execution. Concurrent mounts find the pending entry and return success without spawning duplicate helpers. Unmount waits on se_mtx if a mount is pending, ensuring proper serialization. This allows different snapshots to mount in parallel while preventing the AVL panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17943	2025-12-08 13:49:11 -08:00
Mark Johnston	86b064469d	FreeBSD: Fix a potential null dereference in zfs_freebsd_fsync() In general it's possible for a vnode to not have an associated VM object. This happens in particular with named pipes, which have some distinct VOPs, defined in zfs_fifoops. Thus, this chunk of zfs_freebsd_fsync() needs to check for the FIFO case, like other vm_object_mightbedirty() callers do. (Note that vn_flush_cached_data() calls are predicated on zn_has_cached_data() returning true, and it checks for a NULL v_object pointer already.) Fixes: `ef4058fcdc` Reported-by: Collin Funk <collin.funk1@gmail.com> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18015	2025-12-08 13:46:30 -08:00
Ameer Hamza	88d012a1d6	Fix snapshot automount expiry cancellation deadlock A deadlock occurs when snapshot expiry tasks are cancelled while holding locks. The snapshot expiry task (snapentry_expire) spawns an umount process and waits for it to complete. Concurrently, ARC memory pressure triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the expiry task while holding locks. The umount process spawned by the expiry task blocks trying to acquire locks held by arc_prune, which is blocked waiting for the expiry task to complete. This creates a circular dependency: expiry task waits for umount, umount waits for arc_prune, arc_prune waits for expiry task. Fix by adding non-blocking cancellation support to taskq_cancel_id(). The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to reschedule the unmount, which needs to cancel any existing expiry task. It now uses non-blocking cancellation to avoid waiting while holding locks, breaking the deadlock by returning immediately when the task is already running. The per-entry se_taskqid_lock has been removed, with all taskqid operations now protected by the global zfs_snapshot_lock held as WRITER. Additionally, an se_in_umount flag prevents recursive waits when zfsctl_destroy() is called during unmount. The taskqid is now only cleared by the caller on successful cancellation; running tasks clear their own taskqid upon completion. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17941	2025-12-01 14:43:42 -08:00
Alexander Motin	928eccc5bc	DDT: Reduce global DDT lock scope during writes Before this change DDT lock was taken 4 times per written block, and as effectively a pool-wide lock it can be highly congested. This change introduces a new per-entry dde_io_lock, protecting some fields during I/O ready and done stages, so that we don't need the global lock there. According to my write tests on 64-thread system with 4KB blocks this significantly reduce the global lock contention, reducing CPU usage from 100% to expected ~80%, and increasing write throughput by 10%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17960	2025-12-01 10:44:10 -08:00
Alexander Motin	a5b665df39	DDT: Switch to using wmsums for lookup stats ddt_lookup() is a very busy code under a highly congested global lock. Anything we can save here is very important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17980	2025-12-01 10:36:31 -08:00
Alexander Motin	48f33c1ef2	DDT: Make children writes inherit allocator Even though unlike gang children it is not so critical for dedup children to inherit parent's allocator, there is still no reason for them to have allocation policy different from normal writes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17961	2025-12-01 10:30:27 -08:00
Rob Norris	c631f5e6c2	Linux: bump -std to gnu11 Linux switched from -std=gnu89 to -std=gnu11 in 5.18 (torvalds/linux@e8c07082a8). We've always overridden that with gnu99 because we use some newer features. More recent kernels are using C11 features in headers that we include. GCC generally doesn't seem to care, but more recent versions of Clang seem to be enforcing our gnu99 override more strictly, which breaks the build in some configurations. Just bumping our "override" to match the kernel seems to be the easiest workaround. It's an effective no-op since 5.18, while still allowing us to build on older kernels. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17954	2025-12-01 10:19:11 -08:00
Alexx Saver	39303febac	chksum: run 256K benchmark on demand, preserve chksum_stat_data Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexx Saver <lzsaver.eth@ethermail.io> Co-authored-by: Adam Moss <c@yotes.com> Closes #17945 Closes #17946	2025-12-01 10:14:52 -08:00
Ameer Hamza	36e4f18883	Fix taskq NULL pointer dereference on timer race Remove unsafe timer_pending() check in taskq_cancel_id() that created a race where: - Timer expires and timer_pending() returns FALSE - task_done() frees task with tqent_func = NULL - Timer callback executes and queues freed task - Worker thread crashes executing NULL function Always call timer_delete_sync() unconditionally to ensure timer callback completes before task is freed. Reliably reproducible by injecting mdelay(10) after setting CANCEL flag to widen the race window, combined with frequent task cancellations (e.g., snapshot automount expiry). Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17942	2025-11-19 08:21:10 -08:00
Brian Behlendorf	a49158c064	icp: remove global icp includes Only include the required icp headers. There's no need to include sys/zfs_context.h and pull in all of the zfs headers. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17861	2025-11-12 10:03:51 -08:00
Brian Behlendorf	801d9b4f96	debug: move all of the debug bits out of the spl Pull all of the internal debug infrastructure up in to the zfs code to clean up the layering. Remove all the dodgy usage of SET_ERROR and DTRACE_PROBE from the spl. Luckily it was lightly used in the spl layer so we're not losing much. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17861	2025-11-12 10:02:51 -08:00
Rob Norris	faa295b9a6	libspl: move SID definitions from zfs_context.h; remove kernel gate Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17861	2025-11-12 10:01:48 -08:00
Mariusz Zaborski	02fdd26e51	Add knob to disable slow io notifications Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that allows users to disable notifications for slow devices. This prevents ZED and/or ZFSD from degrading the pool due to slow I/O. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes 17477	2025-11-11 10:42:17 -08:00
Alexander Motin	b4f073b5a6	Add BRT support to zpool prefetch command Implement BRT (Block Reference Table) prefetch functionality similar to existing DDT prefetch. This allows preloading BRT metadata into ARC to improve performance for block cloning operations and frees of earlier cloned blocks. Make -t parameter optional. When omitted, prefetch all supported metadata types (both DDT and BRT now). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17890	2025-11-10 16:16:22 -08:00
Alexander Motin	cc5cae5475	BRT: Increase block size from 4KB to 8KB According to my observations, BRT ZAPs are typically compressible 3:1 for data and 2:1 for indirects. With ashift=12, typical these days, it means increasing the block sizes to 8KB we may get most of possible compression, reducing on-disk and in-ARC BRT footprint in half by the cost of some compression/decompression overhead, but without real write inflation, only some dirty data increase. Increase to 32KB similar to DDT could further increase compression and storage efficiency, but at the cost of write inflation and much bigger dirty data increase, which we can not properly control now. So lets leave this for a time when BRT log gets implemented. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17916	2025-11-10 15:44:46 -08:00
Alexander Motin	72b2a9571a	ZAP: Remove dmu_object_info_from_dnode() call dmu_object_info_from_dnode() takes two locks and copies plenty of data that we don't need in zap_lockdir_impl(). Just read dn_type directly in this hot path. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17921	2025-11-10 14:26:15 -08:00
Rob Norris	6e12f0bd77	spa_misc: add an API for spa_namespace_lock This is useful as debugging support, as it lets namespace lock operations be traced directly. It will also be useful for future work to reduce the use of spa_namespace_lock, traditionally a source of difficult deadlocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17906	2025-11-10 14:23:39 -08:00
Alexander Motin	baefe098ee	ZIO: Set minimum number of free issue threads to 32 Free issue threads might block waiting for synchronous DDT, BRT or GANG header reads. So unlike other taskqs using ZTI_SCALE to scale with number of CPUs, here we also need some amount of threads to potentially saturate pool reads. I am not sure we always want the 96 threads we had before ZTI_SCALE introduction at #11966 on small systems, but lets make it at least 32. While here, make free taskqs configurable, similar to read and write ones. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17903	2025-11-08 14:41:53 -05:00
rmacklem	e26b9fc871	FreeBSD: Add support for _PC_CASE_INSENSITIVE FreeBSD now has a pathconf name called _PC_CASE_INSENSITIVE used to check if a file system performs case insensitive name lookups. This patch adds support for this name. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17908	2025-11-08 13:20:23 -05:00
Brian Behlendorf	962474d1a2	zstd: disable intrinsics Disable the aarch64 NEON SIMD intrinsics for kernel builds. Safely using them in the kernel context requires saving/restoring the FPU registers which is not currently done. Additionally, remove the aarch64 optimized PREFETCH_L1 and PREFETCH_L2 instruction. Rely on the more portable compiler built ins. This lets us remove the problematic workaround in the aarch64_compat.h header which undefines the __aarch64__ macro. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17904 Closes #17852	2025-11-07 10:01:12 -08:00
Adi-Goll	54876ee85e	Fix typo in vdev_raidz.c Change the spelling of "begining" on line 4875 to "beginning". Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com> Closes #17905	2025-11-07 09:55:03 -08:00
Tony Hutter	f93506d1df	Linux 6.17 compat: Fix broken projectquota on 6.17 We need to specifically use the FX_XFLAG_* macros in zpl_ioctl_attr() codepaths, and the FS__FL macros in the zpl_ioctl_flags() codepaths. The earlier code just assumes the FS__FL macros for both codepaths. The 6.17 kernel add a bitmask check in copy_fsxattr_from_user() that exposed this error via failing 'projectquota' ZTS tests. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17884 Closes #17869	2025-11-05 16:22:03 -08:00
Paul Dagnelie	8c225ff1b4	Fix gang write late_arrival bug When a write comes in via dmu_sync_late_arrival, its txg is equal to the open TXG. If that write gangs, and we have not yet activated the new gang header feature, and the gang header we pick can store a larger gang header, we will try to schedule the upgrade for the open TXG + 1. In debug mode, this causes an assertion to trip. This PR sets the TXG for activating the feature to be the larger of either the current open TXG or the syncing TXG + 1. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17824	2025-11-05 11:40:22 -08:00
Robert Evans	d0294aa758	Update dnode_next_offset_level to accept blkid instead of offset Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com> Closes #17792	2025-11-04 13:12:17 -08:00
Alexander Motin	6cfc3dba9c	Cleanup ZIO_FLAG_IO_RETRY vs TRYHARD usage In cases where all issued ZIOs must succeed, and we can't do anything clever about the errors, we should just explicitly set ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries. In other cases, where retries can be different from the original, for example, some ZIOs are allowed to fail due to redundancy, or we can disable aggregation on retrial to get at least some of the data, we can do first pass without TRYHARD, and only if needed retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics). Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17877	2025-10-30 16:29:48 -07:00
Alexander Motin	ec268cdf97	Fix caching of DDT log and BRT Both DDT log and BRT counters we read on pool import and then only append or overwrite in full blocks. We don't need them in DMU or ARC caches. Fortunately we have DMU_UNCACHEDIO for this now. Even more we don't need BRT in non-evictable metadata DMU caches, since it will likely never fit there, while block the cache from its original users. Since DMU_OT_IS_METADATA_CACHED() has no way to differentiate the new metadata types, mark BRT with storage type of DMU_OT_DDT_ZAP. As side effect it will also put it on dedup device, but that should actually be right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-10-30 16:28:28 -07:00
Alexander Motin	ea125eeb5d	BRT: Round bv_entcount up to BRT_BLOCKSIZE Since we set bv_mos_brtvdev block size, and since we keep dirty bitmap at the same granularity, we should keep the allocations and writes done with. Otherwise it makes the last block write short, that will be odd once we implement writing of only dirty blocks, but also requires read-modify-write on DMU layer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-10-30 16:28:05 -07:00
Alexander Motin	dcada084b9	Pass flags to more DMU write/hold functions Over the time many of DMU functions got flags argument to control prefetch, caching, etc. Few functions though left without it, even though closer look shown that many of them do not require prefetch due to their access pattern. This patch adds the flags argument to dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(), passing DMU_READ_NO_PREFETCH where applicable. I am going to also pass DMU_UNCACHEDIO to some of them later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17872	2025-10-29 11:17:51 -07:00
Ryan Libby	0455150f11	FreeBSD zio_crypt.c: initialize uio variables before access In zio_crypt_key_wrap and zio_crypt_key_unwrap, the cuio_s variable was not initialized before the calls to zfs_uio_init, leading to uninitialized access to cuio_s.uio_offset. Initialize it to avoid gcc warnings. Similar issue as fixed in `2bf152021` ("Fix gcc uninitialized warning in FreeBSD zio_crypt.c") Signed-off-by: Ryan Libby <rlibby@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17863	2025-10-23 21:23:25 -04:00
Jean-Sébastien Pédron	3a55e76b84	FreeBSD: zfs_getpages: Don't zero freshly allocated pages Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Jean-Sébastien Pédron <dumbbell@FreeBSD.org> Closes #17851	2025-10-20 17:04:21 -07:00
Rob Norris	fe8b50f09f	Linux 6.18: generic_drop_inode() and generic_delete_inode() renamed Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	3651888182	sha256_generic: make internal functions a little more private Linux 6.18 has conflicting prototypes for various sha256_* and sha512_* functions, which we get through a very long include chain. That's tough to fix right now; easier is just to rename our internal functions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	8911360a41	Linux 6.18: namespace type moved to ns_common The namespace type has moved from the namespace ops struct to the "common" base namespace struct. Detect this and define a macro that does the right thing for both versions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	76c238f1ba	Linux 6.18: replace write_cache_pages() Linux 6.18 removed write_cache_pages() without a usable replacement. Here we implement a minimal zpl_write_cache_pages() that find the dirty pages within the mapping, gets them into the expected state and hands them off to zfs_putpage(), which handles the rest. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	39db4bda80	Linux 6.18: block_device_operations->getgeo takes struct gendisk* Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	5de4a297e7	Linux 6.18: convert ida_simple_* calls ida_simple_get() and ida_simple_remove() are removed in 6.18. However, since 4.19 they have been simple wrappers around ida_alloc() and ida_free(), so we can just use those directly. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	9d50ee59dc	Linux 6.18: replace nth_page() Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Andrew Walker	adacf020ce	Fix return value for setting zvol threading We must return -1 instead of ENOENT if the special zvol threading property set function can't locate the dataset (this would typically happen with an encypted and unmounted zvol) so that the operation gets inserted properly into the nvlist for operations to set. This is because we want the property to be set once the zvol is decrypted again. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #17836	2025-10-20 15:21:40 -07:00
Andrew Walker	783a02b5d3	Fix ZFS_READONLY implementation on Linux MS-FSCC 2.6 is the governing document for DOS attribute behavior. It specifies the following: For a file, applications can read the file but cannot write to it or delete it. For a directory, applications cannot delete it, but applications can create and delete files from the directory. Signed-off-by: Andrew Walker <awalker@ixsystems.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Closes #17837	2025-10-20 09:28:57 -04:00
Brian Behlendorf	5a03e358fc	Update device removal documentation Make a minor update to the 'zpool remove' man page to clarify both raidz and draid pools do not support removal, and change sector to ashift which is what we actually care about. Update the big theory comment in vdev_removal.c to accurately reflect which types of vdevs can be removed. Furthermore, I've added some discussion for the casual reader to briefly explain the top-level vdev removal restrictions. This has been a common area of confusion and it's not intuitive where they come from without understanding the implementation details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17847	2025-10-20 09:26:51 -04:00
Shreshth3	a5af3f2db7	arc: fix small typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com> Closes #17840	2025-10-13 11:23:55 -07:00
Mark Johnston	a9f2a1f361	Fix the type of the raidz_outlier_check_interval_ms parameter It's an hrtime_t, which is an unsigned long long. In practice this is just a U64. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #17833	2025-10-13 10:47:09 -07:00
Dag-Erling Smørgrav	6e5b836e9f	FreeBSD: Correct _PC_MIN_HOLE_SIZE The actual minimum hole size on ZFS is variable, but we always report SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe that they can reliably create holes at 512-byte boundaries and waste resources trying to punch holes that ZFS ends up filling anyway. * In the general case, if the vnode is a regular file, return its current block size, or the record size if the file is smaller than its own block size. If the vnode is a directory, return the dataset record size. If it is neither a regular file nor a directory, return EINVAL. * In the control directory case, always return EINVAL. Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17750	2025-10-08 09:13:22 -04:00

1 2 3 4 5 ...

5206 Commits