Commit Graph

5148 Commits

Author SHA1 Message Date
Mariusz Zaborski
a157ef62a1
Make sure we can still write data to txg
The final txgs are used only to clear out any remaining deferred
frees, and we cannot write new data to them. Make sure we do not
try to do so.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes #18139
2026-01-26 21:33:21 -05:00
Alexander Motin
35b2d39709
Lock db_mtx around arc_release() in couple places
* Lock db_mtx around arc_release() in dbuf_release_bp()

While this function is called only in sync context, the same buffer
can be touched by dbuf_hold_impl() in open context, creating races.
All other accesses to arc_release() are already protected by db_mtx,
so just take it here too.

Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>

* Lock db_mtx in sa_byteswap()

While SA code seems protected by sa_lock, there is a back door of
dmu_objset_userquota_get_ids(), that may hold and access the dbuf
without sa_lock, relying only on db_mtx. Taking db_mtx here should
protect both the arc_release() and the data for db_buf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18146
2026-01-26 21:32:16 -05:00
Alek P
cd895f0e57
remove thread unsafe debug code causing FreeBSD double free panic
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Closes #18140
2026-01-21 10:00:34 -08:00
Alexander Moch
28291536bc Zstd: Document update policy
Add the Zstd update policy to the subtree README.

Also update the documented location of zstd-in.c to match upstream
changes, and normalize naming from 'ZSTD' to 'Zstd'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Moch <mail@alexmoch.com>
Closes #18089
2026-01-20 13:41:24 -08:00
Alexander Moch
2d5a9b6a4c Zstd: Restore SPDX license identifiers
When updating Zstandard to version 1.5.7 the SPDX license identifiers
were lost. This commit restores them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Moch <mail@alexmoch.com>
Closes #18089
2026-01-20 13:41:18 -08:00
Alexander Moch
e7f9734bc7 Zstd: Fix ASan poisoning for pooled Zstd contexts
The Zstd context mempool can reuse buffers that were previously poisoned
under AddressSanitizer, leading to false-positive use-after-poison reports
during zloop and other stress tests.

Explicitly unpoison memory when handing buffers out to Zstd and poison the
user-visible region again when buffers are returned to the pool. This makes
the allocator ASan-correct while preserving existing pooling behavior.

Also fix non-standard void * pointer arithmetic in zstd_free() and remove an
early return in zstd_dctx_alloc() so kmem_type/kmem_size are always set on
pool hits.

This only affects ASan bookkeeping in user space, does not change runtime
behavior in non-ASan configurations, and does not affect on-disk formats.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Moch <mail@alexmoch.com>
Closes #18089
2026-01-20 13:41:12 -08:00
Alexander Moch
a2ac9cd606 Zstd: Integrate v1.5.7 into the ZFS build system
This commit builds on the previous zstd library update and adds the
necessary ZFS integration and build system changes required to make
zstd 1.5.7 compile and function correctly.

Changes:
- Add zstd_preSplit.c (new in 1.5.7) to all build systems.
- Enable x86_64 assembly in userspace (huf_decompress_amd64.S).
- Disable assembly in kernel for RETHUNK/IBT compatibility.
- Disable intrinsics in kernel for EL10 x86_64-v3 baseline.
- Disable tracing in kernel builds for AArch64 compatibility.
- Fix ZSTD_isError symbol renaming with __asm__ directive.
- Rename abs64 to ZSTD_abs64 (FreeBSD kernel conflict).
- Fix bitstream.h attributes (MEM_STATIC -> FORCE_INLINE_TEMPLATE).
- Remove xxhash.c from BSD build (now header-only).
- Update symbol names in zstd_compat_wrapper.h.
- Ignore checkstyle for zstd-in.c.

Kernel assembly disabled for security mitigation compatibility. User
space retains full performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Moch <mail@alexmoch.com>
Closes #18089
2026-01-20 13:41:06 -08:00
Alexander Moch
bbcddb127a Zstd: Update bundled library to v1.5.7 without further adjustments
This commit only replaces the bundled source and does not include any
ZFS integration changes. Because the build depends on integration
adjustments, it will fail until the accompanying integration commit is
applied.

Upstream release: https://github.com/facebook/zstd/releases/tag/v1.5.7

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Moch <mail@alexmoch.com>
Closes #18089
2026-01-20 13:40:37 -08:00
Mark Johnston
54b141fab5
FreeBSD: Remove references to DEBUG_VFS_LOCKS
This option is removed upstream in favour of plain INVARIANTS.

VNASSERT is always defined so I see no reason to use it conditionally.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #18136
2026-01-19 08:55:17 -08:00
Martin Matuška
8605bdfdda
FreeBSD: unbreak compilation on i386
tests/zfs-tests/cmd/mmap_seek.c: use correct printf specifier
module/zfs/vdev.c: vdev_clear(): correctly cast argument to
atomic_add_64().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #18096
2026-01-14 17:02:41 -08:00
Alan Somers
3fffe4e707
Fix --enable-invariants on FreeBSD
The make symbols were never getting forwarded to the correct make
subprocess.  As far as I can tell, this has never worked.  Either that,
or something has changed in the behavior of make.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #18131
2026-01-14 14:54:12 -08:00
shuppy
09e4e01e93
Fix history logging for zpool create -t
`zpool create` is supposed to log the command to the new pool’s history,
as a special record that never gets evicted from the ring buffer. but
when you create a pool with `zpool create -t`, no such record is ever
logged (#18102). that bug may be the cause of issues like #16408.

`zpool create -t` (83e9986f6e) and `zpool
import -t` (26b42f3f9d) are both designed
to override the on-disk zpool property `name` with an in-core
“temporary” name, but they work somewhat differently under the hood.

importing with a temporary name sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` in ZFS_IOC_POOL_IMPORT, which tells
spa_write_cachefile() and spa_config_generate() to use the
ZPOOL_CONFIG_POOL_NAME in `spa->spa_config` instead of `spa->spa_name`.

creating with a temporary name permanently(!) sets the internal zpool
property `tname` (ZPOOL_PROP_TNAME) in the `zc->zc_nvlist_src` of
ZFS_IOC_POOL_CREATE, which tells zfs_ioc_pool_create()
(4ceb8dd6fd) and spa_create() to use that
name instead of `zc->zc_name`, then sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` like an import.

but zfsdev_ioctl_common() fails to check for `tname` when saving the
pool name to `zfs_allow_log_key`, so when we call ZFS_IOC_LOG_HISTORY,
we call spa_open() on the wrong pool name and get ENOENT, so the logging
silently fails.

this patch fixes #18102 by checking for `tname` in zfsdev_ioctl_common()
like we do in zfs_ioc_pool_create().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: delan azabani <dazabani@igalia.com>
Closes #18118  
Closes #18102
2026-01-14 14:51:51 -08:00
Alexander Motin
765929cb4e
DDT: Add locking for table ZAP destruction
Similar to BRT, DDT ZAP can be destroyed by sync context when it
becomes empty.  Respectively similar to BRT introduce RW-lock to
protect open context methods from the destruction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18115
2026-01-13 15:07:15 -08:00
Andrew Walker
aca58dbb65
Add fh_to_parent export definition
This commit adds support for converting a file handle to its
parent dentry. This is called in exportfs_decode_fh_raw()
when subtree checking is enabled in NFS. Defining this and
handling the expanded filehandles allows the knfsd to succeed
in handling the file handle where it might otherwise fail
with ESTALE when trying to open by filehandle.

A side effect of this change is that name_to_handle_at(2)
and open_by_handle_at(2) now support AT_HANDLE_CONNECTABLE.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Andrew Walker <andrew.walker@truenas.com>
Closes #18099
2026-01-08 15:06:12 -08:00
Rob Norris
f2b4ed3fe5 spl: remove a _KERNEL check
This code is only compiled for the Linux kernel module, so that define
is always set.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18117
2026-01-08 10:33:44 -08:00
Rob Norris
02a631139f spl: unexport kstat_proc_entry functions
These are used to implement the kstat and procfs_list interfaces, and
aren't used from outside. There's no need to export them.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18117
2026-01-08 10:33:37 -08:00
Rob Norris
662f33f323 spl: lift 64-bit math compat out to separate file
It's a lot of rarely-compiled code, so move it to the side to make other
code easier to read.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18117
2026-01-08 10:33:32 -08:00
Rob Norris
2ca6e880da spl: remove old atomic lock
Long ago, SPL atomics were implemented as a global spinlock over
conventional operations. In 5e9b5d832b (2009-10) they was converted to
proper atomics, with the spinlock retained as a fallback.

The switch to compile with the fallback was later removed in a91258913f
(2018-05), but the code it enabled wasn't. So lets do that.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18117
2026-01-08 10:33:14 -08:00
Dimitry Andric
2f1f25217f
icp: emit .note.GNU-stack section for all ELF targets
On FreeBSD, linking the zfs kernel module with binutils ld 2.44 shows
the following warning:

    ld: warning: aesni-gcm-avx2-vaes.o: missing .note.GNU-stack section
    implies executable stack
    ld: NOTE: This behaviour is deprecated and will be removed in a
    future version of the linker

Some of the `.S` files under `module/icp/asm-x86_64/modes` check whether
to emit the `.note.GNU-stack` section using:

    #if defined(__linux__) && defined(__ELF__)

We could add `&& defined(__FreeBSD__)` to the test, but since all other
`.S` files in the OpenZFS tree use:

    #ifdef __ELF__

it would seem more logical to use that instead. Any recent ELF platform
should support these note sections by now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #18119
2026-01-08 09:21:12 -08:00
Austin Wise
794f1587db
When receiving a stream with the large block flag, activate feature
ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS
to indicate the presence of large blocks in the dataset. On the sending
side, this flag is included if the `-L` flag is passed to `zfs send`
and the feature is active in the dataset. On the receive side, the
stream is refused if the feature is active in the destination dataset
but the stream does not include the feature flag.

The problem is the feature is only activated when a large block is
born. If a large block has been born in the destination, but never
the source, the send can't work. This can arise when sending streams
back and forth between two datasets.

This commit fixes the problem by always activating the large blocks
feature when receiving a stream with the large block feature flag.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes #18105
2026-01-07 16:47:12 -08:00
Jitendra Patidar
2301755dfb
Fix zfs_open() to skip zil_async_to_sync() for the snapshot
Fix zfs_open() to skip zil_async_to_sync() for the snapshot, as it won't
have any transactions. zfsvfs->z_log is NULL for the snapshot.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #18091
2026-01-06 10:58:56 -08:00
Wolfgang Hoschek
c77f17b750
Add snapshots_changed_nsecs dataset property
Add a read-only dataset property, snapshots_changed_nsecs, which 
exposes the nanosecond resolution version of snapshots_changed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Wolfgang Hoschek <wolfgang.hoschek@mac.com>
Closes #17998
Closes #18031
2026-01-06 09:36:20 -08:00
Andrew Walker
312bdab0f5
Add handling for STATX_CHANGE_COOKIE
This commit adds handling for the STATX_CHANGE_COOKIE so that
we can properly surface the ZFS znode sequence to NFS clients via
knfsd.

If knfsd does not have STATX_CHANGE_COOKIE in statx result then
it will synthesize the NFS change_info4 structure and related
change4id values algorithmically based on the ctime value of the
file. Since internally ZFS is using ktime_get_coarse_real_ts64()
for the timestamp calculation here it introduces the possiblity
that the change will not increment the change4id of directories
/ files causing a failure in the client to invalidate its attr
cache (among other things). See RFC 8881 Section 10.8 for
discussion of how clients may implement name and directory
caching.

Notable in this commit is that we are not initializing the
inode->i_version to the znode->z_seq number. The reason for this
is that we're intentionally not setting `SB_I_VERSION`. This
indicates that the filesystem manages its own i_version and
so it is not populated in the generic_fillattr.

The following compares tight loop of setattr over NFSv4
protocol while traching nfsd4_change_attribute.

Before change:
inode, change_attribute
4723, 7590032215978780890
4723, 7590032215978780890
4723, 7590032215978780890
4723, 7590032215982780865
4723, 7590032215982780865

After change:
inode, change_attribute
7602, 7590032992517123951
7602, 7590032992517123952
7602, 7590032992517123953
7602, 7590032992517123954
7602, 7590032992517123955

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Andrew Walker <andrew.walker@truenas.com>
Closes #18097
2026-01-05 14:06:28 -08:00
Rob Norris
a1319bf654
kmem: don't add __GFP_RECLAIMABLE for KM_VMEM allocations
vmalloc()'d memory is not movable/reclaimable, so __GFP_RECLAIMABLE is
not a valid flag, and since 6.19 the kernel warns if you use it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18107
2026-01-05 13:35:13 -08:00
Rob Norris
f041375b52 kmem: don't add __GFP_COMP for KM_VMEM allocations
It hasn't been necessary since Linux 3.13
(torvalds/linux@a57a49887e), and since 6.19 the kernel warns if you
use it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18053
2025-12-23 12:54:34 -08:00
Rob Norris
f95e306266 kmem: don't pass __GFP_HIGHMEM to __vmalloc
Since Linux 4.12 (torvalds/linux@19809c2da2) __GFP_HIGHMEM has been
automatically added to calls to __vmalloc() internally, so we don't need
it anymore. This is good, because since 6.19 the kernel warns if you use
__GFP_HIGHMEM.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18053
2025-12-23 12:54:11 -08:00
Rob Norris
3c8665cb5d Linux 6.19: replace i_state access with inode_state_read_once()
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18053
2025-12-23 12:53:32 -08:00
Ivan Shapovalov
9880ac3080 zvol: cosmetic: fix up volthreading property short name
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
2025-12-23 11:12:21 -08:00
Rob Norris
654e7628d6 u8_textprep: move into module/zfs
Now that it's built into the main zfs module in all cases, there's no
reason to put it in its own dir.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18071
2025-12-22 14:58:36 -08:00
Alexander Motin
962e68865e
Use reduced precision for scan times
Scan time limits do not need precision beyond 1ms.  Switching
scn_sync_start_time and spa_sync_starttime from gethrtime() to
getlrtime() saves ~3% of CPU time during resilver scan stage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18061
2025-12-18 10:22:11 -08:00
Alexander Motin
a83bb15fcd
Reduce minimal scrub/resilver times
With higher throughput and lower latency of modern devices ZFS can
happily live with pretty short (fractions of a second) TXGs.  But
the two decade old multi-second minimal time limits can almost stop
payload writes by extending TXGs beyond dirty data limits of ARC
ability to amortize it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18060
2025-12-18 10:21:45 -08:00
Mark Maybee
7ff329ac2e
Fix rangelock test for growing block size
If the file already has more than one block, then the current
block size cannot change. But if the file block size is less
than the maximum block size supported by the file system, and
there are multiple blocks in the file, the current code will
almost always extend the rangelock to its maximum size.
This means that all writes become serialized and even reads
are slowed as they will more often contend with writes. This
commit adjusts the test so that we will not lock the entire
range if there is more than one block in the file already.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Maybee <mark.maybee@perforce.com>
Closes #18046
Closes #18064
2025-12-18 09:23:38 -08:00
Alexander Motin
051a8c7494
Bypass snprintf() in quota checks if no quotas set
This improves synthetic 1 byte write speed by ~2.5%.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18063
2025-12-17 21:59:47 -05:00
Alexander Motin
0550abd4b8
RAIDZ: Remove some excessive logging
There were some per I/O logging into dbgmsg in RAIDZ code, that
increased CPU load and wiped useful content out of dbgmsg, for
example during routine disk replacement process.  I don't think
we need it to be that verbose.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18059
2025-12-17 14:00:01 -08:00
Alexander Motin
22e89aca88
DDT: Fix compressed entry buffer size
The first byte of the entry after compression is used for algorithm
and byte order flag.  We should decrement when calling compression/
decompression algorithm.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18055
2025-12-15 14:52:44 -08:00
Alexander Motin
3b1ff816bd
DDT: Add/use zap_lookup_length_uint64_by_dnode()
Unlike other ZAP consumers due to compression DDT does not know
how big entry it is reading from ZAP.  Due to this it called
zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(),
each of which does full ZAP entry lookup.

Introduction of the combined ZAP method dramatically reduces the
CPU overhead and locks contention at DBUF layer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18048
2025-12-15 14:38:34 -08:00
Alexander Motin
ff5414406f
DDT: Switch to using ZAP _by_dnode() interfaces
As was previously done for BRT, avoid holding/releasing DDT ZAP
dnodes for every access.  Instead hold the dnodes during all their
life time, never releasing.

While at this, add _by_dnode() interfaces for zap_length_uint64()
and zap_count(), actively used by DDT code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18047
2025-12-15 09:49:14 -08:00
Alexander Motin
46d6f1fe56
DDT: Move logs searches out of the lock
Postponing entry removal from the DDT log in case of hit till later
single-threaded sync stage allows to make ddl_tree stable during
multi-threaded ZIO processing stage.  It allows to drop the DDT lock
before the search instead of after, reducing the contention a lot.

Actually ddt_log_update_entry() was already handling the case of
entry present in the active log, so we only need to remove it from
flushing log, if the entry happen to be there.

My tests with parallel 4KB block writes show throughput increase
from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even
though still limited by the global DDT lock contention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18044
2025-12-15 09:17:04 -08:00
Alexander Motin
3d76ba2737
Improve async destroy processing timing
Previous code effectively enforced that all async free ZIOs were
_issued_ within the TXG timeout.  But they could take forever to
complete, especially if the required metadata were not in ARC.

This patch introduces periodic waits every 2000 ZIOs, which should
give at least somewhat reasonable TXG timings even for single HDD
pools with empty ARC.  And makes them complete within half of the
TXG timeout, since we might still need time to sync DDT and BRT.

While there, change zfs_max_async_dedup_frees semantics to include
also clone and gang blocks, which are similar.  Bump the default
value from set long ago to be more forgiving to block cloning
(still not having logs and benefiting from large TXGs), now that
we have better working time limits.  The limit now is a possible
amount of dirty data produced by BRT updates.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18043
2025-12-11 18:46:08 -08:00
Alexander Motin
f72fd378c8 Defer async destroys on pool import
We've observed a number of cases when pool import stuck for many
minutes due to large async destroy trying to load DDT or BRT from
HDD pool.  While proper destroy dosage is a separate problem,
lets give import process a chance to complete before that at all.
It may be not enough if there is a lot of ZIL to replay, but that
is harder to cover, since those are in separate syscalls.

Code investigation shown that we already have this mechanism used
for scrub/resilver, so this patch converts SCAN_IMPORT_WAIT_TXGS
into a tunable and applies it to async destroys also.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18033
2025-12-11 18:44:46 -08:00
Alexander Motin
d393166c54
ARC: Increase parallel eviction batching
Before parallel eviction implementation zfs_arc_evict_batch_limit
caused loop exits after evicting 10 headers.  The cost of it is not
big and well motivated.  Now though taskq task exit after the same
10 headers is much more expensive.  To cover the context switch
overhead of taskq introduce another level of batching, controlled
by zfs_arc_evict_batches_limit tunable, used only for parallel
eviction.

My tests including 36 parallel reads with 4KB recordsize that shown
1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention,
now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17970
2025-12-10 13:03:01 -08:00
Rob Norris
9fdb854109
Linux: work around use of GPL-only symbol kasan_flag_enabled
We may not be able to avoid our code referencing the symbol, but we can
ensure that a symbol of that name is available to the linker during
build, and so not require linking the GPL-exported version.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18009
Closes #18040
2025-12-10 10:04:57 -08:00
Chunwei Chen
0c194352b5
Fix ddtprune causing space leak
In zio_ddt_free, if a pruned dde is still in ddt, it would do nothing
and cause space leak.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #17982
Closes #17983
2025-12-10 10:02:14 -08:00
Alex
104da9657a
Fix a declaration position of the nth_page.
Compilation time bug introduced by 87df5e4 commit.
Fix for the compilation error(Linux kernel 6.18.0):
"zfs/module/os/linux/zfs/abd_os.c:920:32: error: implicit declaration
of function ‘nth_page’; did you mean ‘pte_page’?
[-Werror=implicit-function-declaration]".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: agiUnderground <alex.dev.cv@gmail.com>
Closes #18034
2025-12-09 15:45:51 -08:00
Alexander Motin
a62c62120e
ARC: Pre-convert zfs_arc_min_prefetch_ms
There is no need to do MSEC_TO_TICK() for each evicted ARC header.
We can do it when tunables are set, since we already have separate
internal variables for those.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17965
2025-12-09 12:07:10 -08:00
Alexander Motin
09492e0f21
Reduce dataset buffers re-dirtying
For each block written or freed ZFS dirties ds_dbuf of the dataset.
While dbuf_dirty() has a fast path for already dirty dbufs, it still
require taking the lock and doing some things visible in profiler.

Investigation shown ds_dbuf dirtying by dsl_dataset_block_born()
and some of dsl_dataset_block_kill() are just not needed, since
by the time they are called in sync context the ds_dbuf is already
dirtied by dsl_dataset_sync().

Tests show this reducing large file deletion time by ~3% by saving
CPU time of single-threaded part of the sync thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18028
2025-12-09 09:18:09 -08:00
bspengler-oss
060bc8b70d Fix HIGHMEM/kmap API violation in zfs_uiomove_bvec_impl()
Fix another instance where ZFS assumes multiple pages can be
mapped at once via zfs_kmap_local(), resulting in crashes and
potential memory corruption on HIGHMEM-enabled (typically 32-bit)
systems.

Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes #15668
Closes #18030
2025-12-09 09:12:24 -08:00
bspengler-oss
2cab0554c0 Preserve LIFO ordering of kmap ops in abd_raidz_gen_iterate()
ZFS typically preserves proper LIFO ordering regarding map/unmap
operations that wrap the Linux kernel's kmap interfaces that
require such ordering, but one instance in abd_raidz_gen_iterate()
did not.

Similar issues have been fixed in the Linux kernel in the past,
see for instance CVE-2025-39899 for userfaultfd.

Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes #15668
Closes #18030
2025-12-09 09:12:16 -08:00
bspengler-oss
87df5e4872 Fix interaction of abd_iter_map()/abd_iter_unmap() with HIGHMEM
HIGHMEM kmap interfaces operate on only a single page at a time
yet ZFS hadn't accounted for this, resulting in crashes and
potential memory corruption on HIGHMEM (typically 32-bit) systems.
This was caught by PaX's KERNSEAL feature as it makes use of
HIGHMEM functionality on x64.

On typical 64-bit systems, this issue wouldn't have been observed,
as the map interfaces simply fall back to returning an address in
lowmem where the contiguous pages can be accessed directly.

Joint work with the PaX Team, tested by Mark van Dijk

Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes #15668
Closes #18030
2025-12-09 09:10:32 -08:00
Ameer Hamza
4ce030e025
Fix snapshot automount race causing duplicate mounts and AVL tree panic
Multiple threads racing to automount the same snapshot can both spawn
mount helper processes that successfully complete, causing both parent
threads to attempt AVL tree registration and triggering a VERIFY()
panic in avl_add(). This occurs because the fsconfig/fsmount API lacks
the serialization provided by traditional mount() via lock_mount().

The fix adds a per-entry mutex (se_mtx) to zfs_snapentry_t that
serializes mount and unmount operations on the same snapshot. The first
mount thread creates a pending entry with se_spa=NULL and holds se_mtx
during the helper execution. Concurrent mounts find the pending entry
and return success without spawning duplicate helpers. Unmount waits on
se_mtx if a mount is pending, ensuring proper serialization. This allows
different snapshots to mount in parallel while preventing the AVL panic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17943
2025-12-08 13:49:11 -08:00