Commit Graph

5207 Commits

Author SHA1 Message Date
Rob Norris
d8c08a1cea
Linux 7.0: also set setlease handler on directories (#18331)
It turns out the kernel can also take directory leases, most notably in
the NFS server. Without a setlease handler on the directory file ops,
attempts to open a directory over NFS can fail with EINVAL.

Adding a directory setlease handler was missed in 168023b603. This fixes
that, allowing directories to be properly accessed over NFS.

Sponsored-by: TrueNAS
Reported-by: Satadru Pramanik <satadru@gmail.com>

Signed-off-by: Rob Norris <rob.norris@truenas.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2026-03-17 15:28:30 -07:00
Alexander Motin
3583fa38e8 ZVOL: Restrict cloning with different properties
While technically its not a problem to clone between ZVOLs with
different properties, it might create expectation of new properties
being applied during data move, while actually it won't happen.
For copies and checksum it may mean incorrect safety expectations.
For dedup, compression and special_small_blocks -- performance and
space usage.

This is a replica of #18180 from FS.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18315
2026-03-12 18:30:58 -07:00
Alexander Motin
15e37e0919 ZVOL: Add encryption key check for block cloning
Somehow during block cloning porting from file systems was missed
the check for identical encryption keys.  As result, blocks cloned
between unrelated ZVOLs produced authentication errors on later
reads.  Having same or different encryption root does not matter.

This patch copies dmu_objset_crypto_key_equal() call from FS side.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18315
2026-03-12 18:30:26 -07:00
Andriy Tkachuk
b403040c4c
draid: fix data corruption after disk clear
Currently, when there there are several faulted disks with attached
dRAID spares, and one of those disks is cleared from errors (zpool
clear), followed by its spare being detached, the data in all the
remaining spares that were attached while the cleared disk was in
FAULTED state might get corrupted (which can be seen by running scrub).
In some cases, when too many disks get cleared at a time, this can
result in data corruption/loss.

dRAID spare is a virtual device whose blocks are distributed among
other disks. Those disks can be also in FAULTED state with attached
spares on their own. When a disk gets sequentially resilvered (rebuilt),
the changes made by that resilvering won't get captured in the DTL
(Dirty Time Log) of other FAULTED disks with the attached spares to
which the data is written during the resilvering (as it would normally
be done for the changes made by the user if a new file is written or
some existing one is deleted). It is because sequential resilvering
works on the block level, without touching or looking into metadata,
so it doesn't know anything about the old BPs or transactions groups
that it is resilvering. So later on, when that disk gets cleared
from errors and healing resilvering is trying to sync all the data
from its spare onto it, all the changes made on its spare during the
resilvering of other disks will be missed because they won't be
captured in its DTL. That's why other dRAID spares may get corrupted.

Here's another way to explain it that might be helpful. Imagine a
scenario:

1. d1 fails and gets resilvered to some spare s1 - OK.
2. d2 fails and gets sequentially resilvered on draid spare s2. Now,
   in some slices, s2 would map to d1, which is failed. But d1 has s1
   spare attached, so the data from that resilvering goes to s1, but
   not recorded in d1's DTL.
3. Now, d1 gets cleared and its s1 gets detached. All the changes
   done by the user (writes or deletions) have their txgs captured
   in d1's DTL, so they will be resilvered by the healing resilver
   from its spare (s1) - that part works fine. But the data which
   was written during resilvering of d2 and went to s1 - that one
   will be missed from d1's DTL and won't get resilvered to it. So
   here we are:
4. s2 under d2 is corrupted in the slices which map to d1, because
   d1 doesn't have that data resilvered from s1.

Now, if there are more failed disks with draid spares attached which
were sequentially resilvered while d1 was failed, d3+s3, d4+s4 and
so on - all their spares will be corrupted. Because, in some slices,
each of them will map to d1 which will miss their data.

Solution: add all known txgs starting from TXG_INITIAL to DTLs of
non-writable devices during sequential resilvering so when healing
resilver starts on disk clear, it would be able to check and heal
blocks from all txgs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes #18286
Closes #18294
2026-03-11 14:54:20 -07:00
Ameer Hamza
5b93d1a218 L2ARC: Fix prev_hdr use-after-free in l2arc_write_sublist
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18289
2026-03-10 11:00:23 -07:00
Ameer Hamza
b27a87f399 L2ARC: Write budget fairness for metadata monopolization
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run.  Reset
the counter when nothing is written.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18289
2026-03-10 11:00:14 -07:00
Ameer Hamza
62ca8f721b L2ARC: Scan-based depth cap for persistent markers
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.

Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18289
2026-03-10 11:00:08 -07:00
Ameer Hamza
15fc3d64c8 L2ARC: Lazy sublist reset flags for persistent markers
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads.  Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18289
2026-03-10 11:00:01 -07:00
Ameer Hamza
22fdaf0b1f L2ARC: Even sublist headroom distribution with round-robin selection
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18289
2026-03-10 10:59:41 -07:00
Alexander Ziaee
d45c8d6489
FreeBSD: Improve dmesg kernel message prefix
Provide intuitive log search keywords and increased system consistency.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Ziaee <ziaee@FreeBSD.org>
Closes #18290
2026-03-09 10:17:23 -07:00
Juhyung Park
c58b8b7dc2
Fix check for .cfi_negate_ra_state on aarch64
Checking for LD_VERSION in unreliable as not all distros define it on
the compiler's preprocessor.

Explicitly check it via autoconf.

This fixes support for Ubuntu 18.04 on arm64.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Closes #18262
2026-03-06 11:04:37 -08:00
Rob Norris
c329530e6b Add simd_config.h and HAVE_SIMD() selector
We need to select which SIMD variable to check based on the compilation
target: HAVE_KERNEL_xxx for the Linux kernel, HAVE_TOOLCHAIN_xxx for
other platforms.

This adds a HAVE_SIMD() macro returns the right result depending on the
definedness or value of the variable for this target.

The macro is in simd_config.h, which is forcibly included in every
compiler call (like zfs_config.h), to ensure that it can be used
directly without further includes.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18285
2026-03-05 15:01:42 -08:00
Rob Norris
35f74f84e6 Convert all HAVE_<name> SIMD gates to HAVE_SIMD(<name>)
The original names no longer exist, and the new ones will need to be
selectable based on the current compilation target.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18285
2026-03-05 15:01:37 -08:00
Rob Norris
23bd583830 config: remove checks for unused SIMD gates
Specifically, we don't have any code gated on:

    HAVE_SSE
    HAVE_SSE3
    HAVE_SSE4_2
    HAVE_AVX512CD
    HAVE_AVX512DQ
    HAVE_AVX512IFMA
    HAVE_AVX512VBMI
    HAVE_AVX512PF
    HAVE_AVX512ER

So we can remove them and the checks that probe and generate them.

Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18285
2026-03-05 15:01:20 -08:00
Alexander Motin
1e1d64d665
Fix log vdev removal issues
When we clear the log, we should clear all the fields, not only
zh_log.  Otherwise remaining ZIL_REPLAY_NEEDED will prevent the
vdev removal.  Handle it also from the other side, when zh_log
is already cleared, while zh_flags is not.

spa_vdev_remove_log() asserts that allocated space on removed log
device is zero.  While it should be so in perfect world, it might
be not if space leaked at any point.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18277
2026-03-04 09:12:14 -05:00
Rob Norris
1e2c94a043
More consistent use of TREE_* macros in AVL comparators
Where is it appropriate and obvious, use TREE_CMP(), TREE_ISIGN() and
TREE_PCMP() instead or direct comparisons. It can make the code a lot
smaller, less error prone, and easier to read.

Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18259
2026-03-03 09:08:23 -08:00
Brian Behlendorf
0f90a797dd
Fix vdev_rebuild_range() tx commit
The spa_sync thread waits on ->spa_txg_zio and will set ZIO_WAIT_DONE
before running the sync tasks.  The dmu_tx_commit() call must be done
after we add the child zio to the ->spa_txg_zio parent otherwise its
possible the child is added after txg_sync has waited.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #18276
2026-03-03 09:05:34 -08:00
Ryan Moeller
ac0fd40c8c Add zpool properties for allocation class space
The existing zpool properties accounting pool space (size, allocated,
fragmentation, expandsize, free, capacity) are based on the normal
metaslab class or are cumulative properties of several classes combined.

Add properties reporting the space accounting metrics for each metaslab
class individually.

Also introduce pool-wide AVAIL, USABLE, and USED properties reporting
values corresponding to FREE, SIZE, and ALLOC deflated for raidz.

Update ZTS to recognize the new properties and validate reported values.

While in zpool_get_parsable.cfg, add "fragmentation" to the list of
parsable properties.

Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Cloes #18238
2026-03-02 15:50:23 -08:00
Ryan Moeller
6ba3f915d0 zcommon: Fix description of vdev capacity format
Capacity is reported as a percentage not a size.

Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Closes #18238
2026-03-02 15:49:23 -08:00
Akash B
f8e5af53e9
Fix redundant declaration of dsl_pool_t
Remove redundant dsl_pool variable and duplicate spa_get_dsl()
call in vdev_rebuild_thread.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #18263
2026-02-27 10:39:52 -08:00
Andriy Tkachuk
f8457fbdc4
Fix deadlock on dmu_tx_assign() from vdev_rebuild()
vdev_rebuild() is always called with spa_config_lock held in
RW_WRITER mode. However, when it tries to call dmu_tx_assign()
the latter may hang on dmu_tx_wait() waiting for available txg.
But that available txg may not happen because txg_sync takes
spa_config_lock in order to process the current txg. So we have
a deadlock case here:

 - dmu_tx_assign() waits for txg holding spa_config_lock;
 - txg_sync waits for spa_config_lock not progressing with txg.

Here are the stacks:

    __schedule+0x24e/0x590
    schedule+0x69/0x110
    cv_wait_common+0xf8/0x130 [spl]
    __cv_wait+0x15/0x20 [spl]
    dmu_tx_wait+0x8e/0x1e0 [zfs]
    dmu_tx_assign+0x49/0x80 [zfs]
    vdev_rebuild_initiate+0x39/0xc0 [zfs]
    vdev_rebuild+0x84/0x90 [zfs]
    spa_vdev_attach+0x305/0x680 [zfs]
    zfs_ioc_vdev_attach+0xc7/0xe0 [zfs]

    cv_wait_common+0xf8/0x130 [spl]
    __cv_wait+0x15/0x20 [spl]
    spa_config_enter+0xf9/0x120 [zfs]
    spa_sync+0x6d/0x5b0 [zfs]
    txg_sync_thread+0x266/0x2f0 [zfs]

The solution is to pass txg returned by spa_vdev_enter(spa)
at the top of spa_vdev_attach() to vdev_rebuild() and call
dmu_tx_create_assigned(txg) which doesn't wait for txg.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes #18210
Closes #18258
2026-02-26 11:18:02 -08:00
Rob Norris
f3d4c79496
zpl_super: prefer "new" mount API when available
This API has been available since kernel 5.2, and having it available
(almost) everywhere should give us a lot more flexibility for mount
management in the future.

Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18260
2026-02-25 13:17:33 -08:00
Rob Norris
09c27a14a3 icp: add SHA512 implementation using Intel SHA512 extensions
Generated from crypto/sha/asm/sha512-x86_64.pl in
openssl/openssl@241d4826f8.

Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18233
2026-02-25 12:48:30 -08:00
Rob Norris
3547a358fd simd: detect and surface support for Intel SHA512 extensions
Recent Intel CPUs (starting with Arrow Lake and Lunar Lake) include new
vectorised SHA512 instructions. Detect them and make them available to
the rest of the system.

Note the internal name "sha512ext". This is to disambiguate from other
uses of "sha512".

Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18233
2026-02-25 12:47:48 -08:00
clefru
6495dafd58
range_tree: use zfs_panic_recover() for partial-overlap remove
zfs_range_tree_remove_impl() used a bare panic() when a segment to be
removed was not completely overlapped by an existing tree entry.  Every
other consistency check in range_tree.c uses zfs_panic_recover(), which
respects the zfs_recover tunable and allows pools with on-disk
corruption to be imported and recovered.  This one call was
inconsistent, making the partial-overlap case unrecoverable regardless
of zfs_recover.

Replace panic() with zfs_panic_recover() so that operators can set
zfs_recover=1 to import a corrupted pool and reclaim data, consistent
with all other range tree error paths.

Related-to: https://github.com/openzfs/zfs/issues/13483
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #18255
2026-02-25 11:26:10 -08:00
Alexander Motin
991fc56fae
Introduce dedupused/dedupsaved pool properties
Currently there is only a dedup ratio reported via pool properties.
If dedup is enabled only for some datasets, it is impossible to say
how much space the ratio actually covers.  Fix this by introducing
dedupused/dedupsaved pool properties, similar to earlier added
block cloning ones.  Combined with work to expose allocation classes
stats, it should give user-space enough visibility to correlate
`zpool list` and `zfs list` space numbers.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18245
2026-02-25 09:41:38 -05:00
Rob Norris
0f608aa6ca Linux 7.0: add shims for the fs_context-based mount API
The traditional mount API has been removed, so detect when its not
available and instead use a small adapter to allow our existing mount
functions to keep working.

Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18216
2026-02-23 09:45:12 -08:00
Rob Norris
204de946eb Linux 7.0: blk_queue_nonrot() renamed to blk_queue_rot()
It does exactly the same thing, just inverts the return. Detect its
presence or absence and call the right one.

Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18216
2026-02-23 09:44:20 -08:00
MigeljanImeri
4975430cf5
Add vdev property to disable vdev scheduler
Added vdev property to disable the vdev scheduler.
The intention behind this property is to improve IOPS
performance when using o_direct.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes #17358
2026-02-23 09:34:33 -08:00
Tony Hutter
d2f5cb3a50
Move range_tree, btree, highbit64 to common code
Break out the range_tree, btree, and highbit64/lowbit64 code from kernel
space into shared kernel and userspace code.  This is needed for the
updated `zpool status -vv` error byte range reporting that will be
coming in a future commit.  That commit needs the range_tree code in
kernel and userspace.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #18133
2026-02-22 11:43:51 -08:00
Rob Norris
168023b603
Linux 7.0: explicitly set setlease handler to kernel implementation
The upcoming 7.0 kernel will no longer fall back to generic_setlease(),
instead returning EINVAL if .setlease is NULL. So, we set it explicitly.

To ensure that we catch any future kernel change, adds a sanity test for
F_SETLEASE and F_GETLEASE too. Since this is a Linux-specific test,
also a small adjustment to the test runner to allow OS-specific helper
programs.

Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes #18215
2026-02-22 11:39:06 -08:00
Alexander Motin
d06a1d9ac3
Fix available space accounting for special/dedup (#18222)
Currently, spa_dspace (base to calculate dataset AVAIL) only includes
the normal allocation class capacity, but dd_used_bytes tracks space
allocated across all classes.  Since we don't want to report free
space of other classes as available (we can't promise new allocations
will be able to use it), report only allocated space, similar to how
we report space saved by dedup and block cloning.

Since we need deflated space here, make allocation classes track
deflated allocated space also.  While here, make mc_deferred also
deflated, matching its use contexts.  Also while there, use
atomic_load() to read the allocation class stats.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18190
Closes #18222
2026-02-19 10:36:35 -08:00
Tony Hutter
640a217faf
CI: Test & fix Linux ZFS built-in build
ZFS can be built directly into the Linux kernel.  Add a test build
of this to the CI to verify it works.  The test build is only enabled
on Fedora runners (since they run the newest kernels) and is done in
parallel with ZTS.  The test build is done on vm2, since it typically
finishes ~15min before vm1 and thus has time to spare.

In addition:

- Update 'copy-builtin' to check that $1 is a directory
- Fix some VERIFYs that were causing the built-in build to fail

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #18234
2026-02-19 10:15:41 -08:00
Attila Fülöp
c8a72a27e5
ICP: AES-GCM assembly: remove unused Gmul functions
In the AES-GCM assembly files we are defining Gmul functions we
don't use anywhere.

Just remove the dead code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #18226
2026-02-19 10:10:02 -08:00
Alexander Motin
370570890f
Remove parent ZIO from dbuf_prefetch()
I am not sure why it was added there 10 years ago, but it seems not
needed now.  According to my tests removing it improves sequential
read performance with recordsize=4K by 5-10% by reducing the CPU
overhead in prefetcher.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18214
2026-02-18 18:12:13 -08:00
Attila Fülöp
d489677280
ICP: AES-GCM VAES-AVX2: fix typos and document source files
Require AVX2 compiler support and document source files for
`aesni-gcm-avx2-vaes.S`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #18225
2026-02-17 16:51:32 -08:00
Attila Fülöp
bee53d8c10
Linux 6.19 compat: in-tree build: fix duplicate GCM assembly functions
Linux 6.19 added an AES-GCM VAES-AVX2 assembly implementation. It's
basically a translation from the BoringSSL perlasm syntax to macro
assembly. We're using the same source but the perlasm generated flat
assembly which shares some global function names with the former.
When  building in-tree this results in the linker failing due to the
duplicate symbols.

To avoid the error we prepend `icp_` via a macro to our function
names.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Moch <mail@alexmoch.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #18204
Closes #18224
2026-02-17 13:09:41 -08:00
Alexander Motin
0f9564e85b
Simplify dnode_level_is_l2cacheable()
We should not dereference through dn_handle->dnh_dnode once we
already have a dnode pointer.  The result will be the same.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18212
2026-02-16 10:34:22 -05:00
Alexander Motin
ba970eb202
Cleanup allocation class selection
- For multilevel gang blocks it seemed possible to fallback from
normal to special class, since they don't have proper object type,
and DMU_OT_NONE is a "metadata".  They should never fallback.
 - Fix possible inversion with zfs_user_indirect_is_special = 0,
when indirects written to normal vdev, while small data to special.
Make small indirect blocks also follow special_small_blocks there.
 - With special_small_blocks now applying to both files and ZVOLs,
make it apply to all non-metadata without extra checks, since there
are no other non-metadata types.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18208
2026-02-16 10:33:21 -05:00
Mariusz Zaborski
cdf89f413c
Flush RRD only when TXGs contain data
This change modifies the behavior of spa_sync_time_logger when
flushing the RRD database.

Previously, once the sync interval elapsed, a flush would always
be generated. On solid-state devices, especially when the pool was
otherwise idle, this caused disks to wake up solely to write RRD
data. Since RRD is best-effort telemetry, this behavior is
unnecessary and wasteful.

With this change, spa_sync_time_logger delays flushing until a TXG
that already contains data is being synced. The RRD update is
appended to that TXG instead of forcing the creation of
a new write-only TXG.

During pool export, flushing is forced regardless of whether
the TXG contains user data. At that stage, data durability takes
precedence and a write must be issued.

Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes #18082
Closes #18138
2026-02-11 11:35:45 -08:00
Marc Sladek
cc184fe98b
Fix send:raw permission for send -w -I
When performing an incremental raw send with intermediates (-w -I),
the standard 'send' permission was incorrectly required instead of
allowing 'send:raw'. This was due to a strict boolean comparison on
the 'rawok' flag in zfs_secpolicy_send() with non-boolean value.

This change normalizes the 'rawok' variable to be strictly 0/1 and
updates the test suite to properly verify delegated raw send behavior.

Introduced-by: https://github.com/openzfs/zfs/pull/17543
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Marc Sladek <marc@sladek.dev>
Closes #18198
Closes #18193
2026-02-11 10:30:26 -08:00
Alexander Motin
aa29455dd7
Restrict cloning with different properties
While technically its not a problem to clone between datasets with
different properties, it might create expectation of new properties
being applied during data move, while actually it won't happen.
For copies and checksum it may mean incorrect safety expectations.
For dedup, compression and special_small_blocks -- performance and
space usage. New zfs_bclone_strict_properties tunable controls it.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18180
2026-02-10 09:53:24 -08:00
rmacklem
1412bdc6c2
zfs_vnops_os.c: Move a vput() to after zfs_setattr_dir()
Without this patch, the following crash can occur when
a file system is configured with "xattr=dir".

VNASSERT failed: locked not true at
 /posix-acl/freebsd-rdma/sys/kern/vfs_subr.c:5786 (assert_vop_locked)
    hold count flags ()
    flags ()
    lock type zfs: UNLOCKED
panic: zfs_dirent_lookup: vnode is not locked but should be
cpuid = 3
time = 1770520763
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b
vpanic() at vpanic+0x136/frame 0xfffffe00914c8270
panic() at panic+0x43/frame 0xfffffe00914c82d0
assert_vop_locked() at assert_vop_locked+0x78
zfs_dirent_lookup() at zfs_dirent_lookup+0x41
zfs_setattr_dir() at zfs_setattr_dir+0x123
zfs_setattr() at zfs_setattr+0x1389
zfs_freebsd_setattr() at zfs_freebsd_setattr+0x56b
VOP_SETATTR_APV() at VOP_SETATTR_APV+0x5d
setfown() at setfown+0xb1
kern_fchownat() at kern_fchownat+0x192

This patch fixes the problem by moving the vput() call for
attrzp to after the zfs_setattr_dir() call that takes it as
an argument.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes: #18188
2026-02-10 09:29:37 -05:00
Alexander Motin
2646bd5585
Allow rewrite skip cloned and snapshotted blocks
Rewrite of cloned and snapshotted blocks can allocate additional
space, that may be undesired.  In some cases it may have sense
to still rewrite snapshotted blocks, expecting the snapshots to
rotate with time, freeing space.  In other cases rewrite of cloned
blocks may be acceptable, despite persistent space usage increase.
For this reason add them as separate flags to `zfs rewrite`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18179
2026-02-09 10:17:56 -08:00
Brian Behlendorf
ae488e496f ZTS: update the relevant mmp test cases
- mmp_concurrent_import: added test case to verify that concurrent
  import correctness.  The pool may only be imported once.

- mmp_exported_import: an activity check is now required for pools
  which were cleanly exported if the system and pool hostids don't
  match.

- mmp_inactive_import: an activity check is now required for any
  pool which wasn't cleanly exported, even if the system and pool
  hostids match.

- mmp_on_uberblocks: updated expected uberblocks to take in to account
  the value MMP_INTERVAL_DEFAULT is set too.

- mmp_reset_interval: reduce the number of iterations from 10 to 3.
  This is sufficient to verify functionality and significantly speeds
  up the test.

- mmp_on_uberblocks: adjust the thresholds and increase the runtime
  to avoid false positives observed in CI.

- Update tests to use 'zhack action idle' instead of ztest to improve
  the reliability of the tests.

- Add additional log_note messages to test cases which have multiple
  verification steps to make it clear which portion of a test failed
  when reviewing the logs.

- Replace default_setup/cleanup_noexit calls with 'zpool create' and
  'zpool destroy' calls to avoid additional unnecessary dataset
  creation work.

- Update activity/noactivity check helper functions to use the
  ZFS_LOAD_INFO_DEBUG information now available from 'zpool import'
  to determine if this activity check ran and why.  This is more
  reliable in the CI than measuring the runtime.

- Removed all mmp tests from the zts-report.py exceptions list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:36:18 -08:00
Brian Behlendorf
20176224ee mmp: claim sequence id before final import
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts.  This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect.  This extra
safety chceck operates as follows:

1. Repeats the following MMP check 10 times:
  a. Write out an MMP uberblock with the best txg and a random
     sequence id to all primary pool vdevs.
  b. Verify a minimum number of good writes such that even if
     the pool appears degraded on the remote host it will see
     at least one of the updated MMP uberblocks.
  c. Wait for the MMP interval this leaves a window for other
     racing hosts to make similar modifications which can be
     detected.
  d. Call vdev_uberblock_load() to determine the best uberblock
     to use, this should be the MMP uberblock just written.
  e. Verify the txg and random sequeunce number match the MMP
     uberblock written in 1a.

2. Restore the original MMP uberblocks.  This allows the check
   to be performed again if the pool fails to import for an
   unrelated reason.

This change also includes some refactoring and minor improvements.

- Never try loading earlier txgs during import when the import
  fails with EREMOTEIO or EINTER.  These errors don't indicate
  the txg is damaged but instead that its either in use on a
  remote host or the import was interactively cancelled.  No
  rewind is also performed for EBADD which can result from a
  stale trusted config when doing a verbatim import.

- Refactor the code for consistent logging of the multihost
  activity check using spa_load_note() and console messages
  indicating when the activity check was trigger and the result.

- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
  modification of the sequence number in an uberblock.

- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
  set to log to dump to stdout the spa_load_info nvlist returned
  during import.  This is used by the updated mmp test cases
  to determine if an activity check was run and its result.

- Standardize the mmp messages similarly to make it easier to
  find all the relevent mmp lines in the debug log.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:36:01 -08:00
Brian Behlendorf
2f048ced4d mmp: add spa_load_name() for tryimport
Tryimport adds a unique prefix to the pool name to avoid name
collisions.  This makes it awkward to log user-friendly info
during a tryimport.  Add a spa_load_name() function which can
be used to report the unmodified pool name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:35:03 -08:00
Brian Behlendorf
62a1bf7d19 mmp: move "Starting import" log message
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:34:57 -08:00
Brian Behlendorf
a9564b1787 mmp: further restrict mmp exported pool check
For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check.  Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:32:58 -08:00
Austin Wise
4f180e095a
Fix activating large_microzap on receive
This ensures that the in-memory state of the feature is recorded and
that `dsl_dataset_activate_feature` is not called when the feature
is already active.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes #18143
Closes #18144
2026-02-05 15:48:03 -08:00