Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
The two additional fields are never used by calling code, and we can
replace their sole internal use with an extra stack param.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
Only used for when the mount cache was disabled, but since its always
enabled now, we don't need it.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
FreeBSD's getextmntent.c is only separate because it has a different
license to mnttab.c, otherwise it would go there too.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
We can't change the public interface, but internally we don't need so
much redundant naming.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
There's no real reason not to enable it always; the `zfs` command always
enables it anyway, and right now there's multiple places that do mount
work that don't go through the cache anyway. Having it always be on lets
us remove a bunch of the fallback code.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
More consistent, less typing, and we can check ownership.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18296
prev_hdr is dereferenced after the sublist lock is dropped for write I/O
but nothing prevents it from being freed during that window. Eliminate
prev_hdr entirely and simplify persistent marker repositioning logic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18289
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18289
Under heavy metadata load, metadata passes can monopolize the write
budget every cycle while data passes get nothing written. Track
consecutive monopolized cycles per device in l2ad_meta_cycles. After
l2arc_meta_cycles (default 2) consecutive cycles where metadata fills
the write budget, skip metadata for one cycle to let data run. Reset
the counter when nothing is written.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18289
With persistent markers and inclusive scanning, the marker traverses the
entire ARC state across many feed cycles, writing buffers far from the
tail that may no longer be relevant.
Track cumulative bytes scanned per pass in l2arc_ext_scanned. When scans
reach l2arc_ext_headroom_pct (default 25%) of the ARC state size, reset
the pass markers to the tail via lazy reset flags. This keeps markers
focused on the tail zone where buffers soon to be evicted have the most
value for L2ARC.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18289
Replace direct marker-to-tail manipulation with per-sublist boolean
flags consumed lazily by feed threads. Each scanning thread resets its
own marker when it sees the flag, rather than having another thread
manipulate the marker directly.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18289
The dynamic headroom redistribution formula gave later sublists
progressively larger scanning budgets, and random sublist selection
caused uneven coverage across sublists. For depth cap to work
effectively, each sublist should be equally and fairly treated.
Use equal per-sublist headroom (headroom / num_sublists) for even
distribution and deterministic round-robin selection for fair
coverage across cycles.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18289
The autoconf checks are more than enough to decide whether or not we can
work with this kernel or not.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18295
Teach `zfs {create,clone,rename}` to accept a doubled `-p` flag (`-pp`)
to create non-existing ancestor datasets with `canmount=off`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#17000
This will be used to support creating non-mountable ancestors in zfs(8).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#17000
When a namespace property is changed via zfs set, libzfs remounts the
filesystem to propagate the new VFS mount flags. The current approach
uses mount(2) with MS_REMOUNT, which reads all namespace properties
from ZFS and applies them together. This has two problems:
1. Linux VFS resets unspecified per-mount flags on remount. If an
administrator sets a temporary flag (e.g. mount -o remount,noatime),
a subsequent zfs set on any namespace property clobbers it.
2. Two concurrent zfs set operations on different namespace properties
can overwrite each other's mount flags.
Additionally, legacy datasets (mountpoint=legacy) were never remounted
on namespace property changes since zfs_is_mountable() returns false
for them.
Add zfs_mount_setattr() which uses mount_setattr(2) to selectively
update only the mount flags that correspond to the changed property.
For legacy datasets, /proc/mounts is iterated to update all
mountpoints. On kernels without mount_setattr (ENOSYS), non-legacy
datasets fall back to a full remount; legacy mounts are skipped to
avoid clobbering temporary flags.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18257
Provide intuitive log search keywords and increased system consistency.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Ziaee <ziaee@FreeBSD.org>
Closes#18290
When creating a pool with devices that have incompatible block sizes,
the kernel returns EDOM. However, zpool_create() did not handle this
errno, falling through to zpool_standard_error() which produced a
confusing message about invalid property values.
Add a case EDOM handler in zpool_create() to return EZFS_BADDEV with
a descriptive auxiliary message, consistent with the existing EDOM
handler in zpool_vdev_add().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18268
This commit adds an implementation of lzc_send_progress, which
existed in the libzfs_core header, but not in ABI and lacked
an actual implementation. The libzfs_send_progress function
is altered so that it wraps around the lzc operation. This
fills a functional gap in libzfs core.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Andrew Walker <andrew.walker@truenas.com>
Closes#18288
Checking for LD_VERSION in unreliable as not all distros define it on
the compiler's preprocessor.
Explicitly check it via autoconf.
This fixes support for Ubuntu 18.04 on arm64.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Closes#18262
So its easier to remove and replace on non-Unix platforms.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18281
zstream currently contains three identical copies of dump_record(),
which appear to all be drawn from libzfs_sendrecv.c. The original
is marked internal.
This PR adds zstream_util.[hc] and puts the shared code there along with
a couple of other items in common.
No functional changes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Garth Snyder <garth@garthsnyder.com>
Closes#18284
* Add an option to send datasets with params or replicate
without preserving encryption
* Add a test case for the new functionality
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Chris Jacobs <idefix2020dev@gmail.com>
Closes#18240
We need to select which SIMD variable to check based on the compilation
target: HAVE_KERNEL_xxx for the Linux kernel, HAVE_TOOLCHAIN_xxx for
other platforms.
This adds a HAVE_SIMD() macro returns the right result depending on the
definedness or value of the variable for this target.
The macro is in simd_config.h, which is forcibly included in every
compiler call (like zfs_config.h), to ensure that it can be used
directly without further includes.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18285
The original names no longer exist, and the new ones will need to be
selectable based on the current compilation target.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18285
The kernel may be built with a different compiler, and also includes
objtool, which may fail on unknwon instructions sequences. So, we want
to run the checks a second time for that toolchain too.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18285
No need to repeat all that boilerplate each time!
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18285
Specifically, we don't have any code gated on:
HAVE_SSE
HAVE_SSE3
HAVE_SSE4_2
HAVE_AVX512CD
HAVE_AVX512DQ
HAVE_AVX512IFMA
HAVE_AVX512VBMI
HAVE_AVX512PF
HAVE_AVX512ER
So we can remove them and the checks that probe and generate them.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18285
Most of the X86_FEATURE_* defines we use were introduced in kernels much
older than those we support, so there's no need to check for them.
For the history, these are the ones being removed, and the kernel
versions/commits where they were introduced:
<4.6 torvalds/linux@cd4d09ec6f (refactor/consolidation commit)
OSXSAVE
BMI1
BMI2
AES
PCLMULQDQ
MOVBE
SHA_NI
AVX512F
AVX512CD
AVX512ER
AVX512PF
4.6 torvalds/linux@d050049442
AVX512BW
AVX512DQ
AVX512VL
4.10 torvalds/linux@a8d9df5a50
AVX512IFMA
AVX512VBMI
4.15 torvalds/linux@c128dbfa0f
VAES
VPCLMULQDQ
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18285
When we clear the log, we should clear all the fields, not only
zh_log. Otherwise remaining ZIL_REPLAY_NEEDED will prevent the
vdev removal. Handle it also from the other side, when zh_log
is already cleared, while zh_flags is not.
spa_vdev_remove_log() asserts that allocated space on removed log
device is zero. While it should be so in perfect world, it might
be not if space leaked at any point.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18277
Decrease the number of required uberblock blocks write slightly due
to observed variation when running in the CI. This should help
avoid future false positives.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18280
The following tests have been observed to occasionally fail when
running under the CI. Updated our exceptions list to track them.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18274
Where is it appropriate and obvious, use TREE_CMP(), TREE_ISIGN() and
TREE_PCMP() instead or direct comparisons. It can make the code a lot
smaller, less error prone, and easier to read.
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18259
The spa_sync thread waits on ->spa_txg_zio and will set ZIO_WAIT_DONE
before running the sync tasks. The dmu_tx_commit() call must be done
after we add the child zio to the ->spa_txg_zio parent otherwise its
possible the child is added after txg_sync has waited.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18276
The existing zpool properties accounting pool space (size, allocated,
fragmentation, expandsize, free, capacity) are based on the normal
metaslab class or are cumulative properties of several classes combined.
Add properties reporting the space accounting metrics for each metaslab
class individually.
Also introduce pool-wide AVAIL, USABLE, and USED properties reporting
values corresponding to FREE, SIZE, and ALLOC deflated for raidz.
Update ZTS to recognize the new properties and validate reported values.
While in zpool_get_parsable.cfg, add "fragmentation" to the list of
parsable properties.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Cloes #18238
Capacity is reported as a percentage not a size.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Closes#18238
Remove redundant dsl_pool variable and duplicate spa_get_dsl()
call in vdev_rebuild_thread.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#18263
vdev_rebuild() is always called with spa_config_lock held in
RW_WRITER mode. However, when it tries to call dmu_tx_assign()
the latter may hang on dmu_tx_wait() waiting for available txg.
But that available txg may not happen because txg_sync takes
spa_config_lock in order to process the current txg. So we have
a deadlock case here:
- dmu_tx_assign() waits for txg holding spa_config_lock;
- txg_sync waits for spa_config_lock not progressing with txg.
Here are the stacks:
__schedule+0x24e/0x590
schedule+0x69/0x110
cv_wait_common+0xf8/0x130 [spl]
__cv_wait+0x15/0x20 [spl]
dmu_tx_wait+0x8e/0x1e0 [zfs]
dmu_tx_assign+0x49/0x80 [zfs]
vdev_rebuild_initiate+0x39/0xc0 [zfs]
vdev_rebuild+0x84/0x90 [zfs]
spa_vdev_attach+0x305/0x680 [zfs]
zfs_ioc_vdev_attach+0xc7/0xe0 [zfs]
cv_wait_common+0xf8/0x130 [spl]
__cv_wait+0x15/0x20 [spl]
spa_config_enter+0xf9/0x120 [zfs]
spa_sync+0x6d/0x5b0 [zfs]
txg_sync_thread+0x266/0x2f0 [zfs]
The solution is to pass txg returned by spa_vdev_enter(spa)
at the top of spa_vdev_attach() to vdev_rebuild() and call
dmu_tx_create_assigned(txg) which doesn't wait for txg.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18210Closes#18258
This API has been available since kernel 5.2, and having it available
(almost) everywhere should give us a lot more flexibility for mount
management in the future.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18260
Generated from crypto/sha/asm/sha512-x86_64.pl in
openssl/openssl@241d4826f8.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18233
Recent Intel CPUs (starting with Arrow Lake and Lunar Lake) include new
vectorised SHA512 instructions. Detect them and make them available to
the rest of the system.
Note the internal name "sha512ext". This is to disambiguate from other
uses of "sha512".
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18233
zfs_range_tree_remove_impl() used a bare panic() when a segment to be
removed was not completely overlapped by an existing tree entry. Every
other consistency check in range_tree.c uses zfs_panic_recover(), which
respects the zfs_recover tunable and allows pools with on-disk
corruption to be imported and recovered. This one call was
inconsistent, making the partial-overlap case unrecoverable regardless
of zfs_recover.
Replace panic() with zfs_panic_recover() so that operators can set
zfs_recover=1 to import a corrupted pool and reclaim data, consistent
with all other range tree error paths.
Related-to: https://github.com/openzfs/zfs/issues/13483
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#18255
Fedora 41 was deprecated on Dec 15 2025. Remove it from CI tests.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18261
Currently there is only a dedup ratio reported via pool properties.
If dedup is enabled only for some datasets, it is impossible to say
how much space the ratio actually covers. Fix this by introducing
dedupused/dedupsaved pool properties, similar to earlier added
block cloning ones. Combined with work to expose allocation classes
stats, it should give user-space enough visibility to correlate
`zpool list` and `zfs list` space numbers.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18245
This patch fixes a segmentation fault in zhack metaslab leak which might
be triggered by feeding zhack with a fragmentation profile that's
exported from a pool larger than the target pool.
Fixes: 8f15d2e4d5
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>