Commit Graph

3557 Commits

Author SHA1 Message Date
Ryan Moeller
ac0fd40c8c Add zpool properties for allocation class space
The existing zpool properties accounting pool space (size, allocated,
fragmentation, expandsize, free, capacity) are based on the normal
metaslab class or are cumulative properties of several classes combined.

Add properties reporting the space accounting metrics for each metaslab
class individually.

Also introduce pool-wide AVAIL, USABLE, and USED properties reporting
values corresponding to FREE, SIZE, and ALLOC deflated for raidz.

Update ZTS to recognize the new properties and validate reported values.

While in zpool_get_parsable.cfg, add "fragmentation" to the list of
parsable properties.

Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Cloes #18238
2026-03-02 15:50:23 -08:00
Akash B
f8e5af53e9
Fix redundant declaration of dsl_pool_t
Remove redundant dsl_pool variable and duplicate spa_get_dsl()
call in vdev_rebuild_thread.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #18263
2026-02-27 10:39:52 -08:00
Andriy Tkachuk
f8457fbdc4
Fix deadlock on dmu_tx_assign() from vdev_rebuild()
vdev_rebuild() is always called with spa_config_lock held in
RW_WRITER mode. However, when it tries to call dmu_tx_assign()
the latter may hang on dmu_tx_wait() waiting for available txg.
But that available txg may not happen because txg_sync takes
spa_config_lock in order to process the current txg. So we have
a deadlock case here:

 - dmu_tx_assign() waits for txg holding spa_config_lock;
 - txg_sync waits for spa_config_lock not progressing with txg.

Here are the stacks:

    __schedule+0x24e/0x590
    schedule+0x69/0x110
    cv_wait_common+0xf8/0x130 [spl]
    __cv_wait+0x15/0x20 [spl]
    dmu_tx_wait+0x8e/0x1e0 [zfs]
    dmu_tx_assign+0x49/0x80 [zfs]
    vdev_rebuild_initiate+0x39/0xc0 [zfs]
    vdev_rebuild+0x84/0x90 [zfs]
    spa_vdev_attach+0x305/0x680 [zfs]
    zfs_ioc_vdev_attach+0xc7/0xe0 [zfs]

    cv_wait_common+0xf8/0x130 [spl]
    __cv_wait+0x15/0x20 [spl]
    spa_config_enter+0xf9/0x120 [zfs]
    spa_sync+0x6d/0x5b0 [zfs]
    txg_sync_thread+0x266/0x2f0 [zfs]

The solution is to pass txg returned by spa_vdev_enter(spa)
at the top of spa_vdev_attach() to vdev_rebuild() and call
dmu_tx_create_assigned(txg) which doesn't wait for txg.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes #18210
Closes #18258
2026-02-26 11:18:02 -08:00
clefru
6495dafd58
range_tree: use zfs_panic_recover() for partial-overlap remove
zfs_range_tree_remove_impl() used a bare panic() when a segment to be
removed was not completely overlapped by an existing tree entry.  Every
other consistency check in range_tree.c uses zfs_panic_recover(), which
respects the zfs_recover tunable and allows pools with on-disk
corruption to be imported and recovered.  This one call was
inconsistent, making the partial-overlap case unrecoverable regardless
of zfs_recover.

Replace panic() with zfs_panic_recover() so that operators can set
zfs_recover=1 to import a corrupted pool and reclaim data, consistent
with all other range tree error paths.

Related-to: https://github.com/openzfs/zfs/issues/13483
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #18255
2026-02-25 11:26:10 -08:00
Alexander Motin
991fc56fae
Introduce dedupused/dedupsaved pool properties
Currently there is only a dedup ratio reported via pool properties.
If dedup is enabled only for some datasets, it is impossible to say
how much space the ratio actually covers.  Fix this by introducing
dedupused/dedupsaved pool properties, similar to earlier added
block cloning ones.  Combined with work to expose allocation classes
stats, it should give user-space enough visibility to correlate
`zpool list` and `zfs list` space numbers.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan.moeller@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18245
2026-02-25 09:41:38 -05:00
MigeljanImeri
4975430cf5
Add vdev property to disable vdev scheduler
Added vdev property to disable the vdev scheduler.
The intention behind this property is to improve IOPS
performance when using o_direct.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes #17358
2026-02-23 09:34:33 -08:00
Tony Hutter
d2f5cb3a50
Move range_tree, btree, highbit64 to common code
Break out the range_tree, btree, and highbit64/lowbit64 code from kernel
space into shared kernel and userspace code.  This is needed for the
updated `zpool status -vv` error byte range reporting that will be
coming in a future commit.  That commit needs the range_tree code in
kernel and userspace.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #18133
2026-02-22 11:43:51 -08:00
Alexander Motin
d06a1d9ac3
Fix available space accounting for special/dedup (#18222)
Currently, spa_dspace (base to calculate dataset AVAIL) only includes
the normal allocation class capacity, but dd_used_bytes tracks space
allocated across all classes.  Since we don't want to report free
space of other classes as available (we can't promise new allocations
will be able to use it), report only allocated space, similar to how
we report space saved by dedup and block cloning.

Since we need deflated space here, make allocation classes track
deflated allocated space also.  While here, make mc_deferred also
deflated, matching its use contexts.  Also while there, use
atomic_load() to read the allocation class stats.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18190
Closes #18222
2026-02-19 10:36:35 -08:00
Tony Hutter
640a217faf
CI: Test & fix Linux ZFS built-in build
ZFS can be built directly into the Linux kernel.  Add a test build
of this to the CI to verify it works.  The test build is only enabled
on Fedora runners (since they run the newest kernels) and is done in
parallel with ZTS.  The test build is done on vm2, since it typically
finishes ~15min before vm1 and thus has time to spare.

In addition:

- Update 'copy-builtin' to check that $1 is a directory
- Fix some VERIFYs that were causing the built-in build to fail

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #18234
2026-02-19 10:15:41 -08:00
Alexander Motin
370570890f
Remove parent ZIO from dbuf_prefetch()
I am not sure why it was added there 10 years ago, but it seems not
needed now.  According to my tests removing it improves sequential
read performance with recordsize=4K by 5-10% by reducing the CPU
overhead in prefetcher.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18214
2026-02-18 18:12:13 -08:00
Alexander Motin
0f9564e85b
Simplify dnode_level_is_l2cacheable()
We should not dereference through dn_handle->dnh_dnode once we
already have a dnode pointer.  The result will be the same.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18212
2026-02-16 10:34:22 -05:00
Alexander Motin
ba970eb202
Cleanup allocation class selection
- For multilevel gang blocks it seemed possible to fallback from
normal to special class, since they don't have proper object type,
and DMU_OT_NONE is a "metadata".  They should never fallback.
 - Fix possible inversion with zfs_user_indirect_is_special = 0,
when indirects written to normal vdev, while small data to special.
Make small indirect blocks also follow special_small_blocks there.
 - With special_small_blocks now applying to both files and ZVOLs,
make it apply to all non-metadata without extra checks, since there
are no other non-metadata types.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18208
2026-02-16 10:33:21 -05:00
Mariusz Zaborski
cdf89f413c
Flush RRD only when TXGs contain data
This change modifies the behavior of spa_sync_time_logger when
flushing the RRD database.

Previously, once the sync interval elapsed, a flush would always
be generated. On solid-state devices, especially when the pool was
otherwise idle, this caused disks to wake up solely to write RRD
data. Since RRD is best-effort telemetry, this behavior is
unnecessary and wasteful.

With this change, spa_sync_time_logger delays flushing until a TXG
that already contains data is being synced. The RRD update is
appended to that TXG instead of forcing the creation of
a new write-only TXG.

During pool export, flushing is forced regardless of whether
the TXG contains user data. At that stage, data durability takes
precedence and a write must be issued.

Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes #18082
Closes #18138
2026-02-11 11:35:45 -08:00
Marc Sladek
cc184fe98b
Fix send:raw permission for send -w -I
When performing an incremental raw send with intermediates (-w -I),
the standard 'send' permission was incorrectly required instead of
allowing 'send:raw'. This was due to a strict boolean comparison on
the 'rawok' flag in zfs_secpolicy_send() with non-boolean value.

This change normalizes the 'rawok' variable to be strictly 0/1 and
updates the test suite to properly verify delegated raw send behavior.

Introduced-by: https://github.com/openzfs/zfs/pull/17543
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Marc Sladek <marc@sladek.dev>
Closes #18198
Closes #18193
2026-02-11 10:30:26 -08:00
Alexander Motin
aa29455dd7
Restrict cloning with different properties
While technically its not a problem to clone between datasets with
different properties, it might create expectation of new properties
being applied during data move, while actually it won't happen.
For copies and checksum it may mean incorrect safety expectations.
For dedup, compression and special_small_blocks -- performance and
space usage. New zfs_bclone_strict_properties tunable controls it.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18180
2026-02-10 09:53:24 -08:00
Alexander Motin
2646bd5585
Allow rewrite skip cloned and snapshotted blocks
Rewrite of cloned and snapshotted blocks can allocate additional
space, that may be undesired.  In some cases it may have sense
to still rewrite snapshotted blocks, expecting the snapshots to
rotate with time, freeing space.  In other cases rewrite of cloned
blocks may be acceptable, despite persistent space usage increase.
For this reason add them as separate flags to `zfs rewrite`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18179
2026-02-09 10:17:56 -08:00
Brian Behlendorf
ae488e496f ZTS: update the relevant mmp test cases
- mmp_concurrent_import: added test case to verify that concurrent
  import correctness.  The pool may only be imported once.

- mmp_exported_import: an activity check is now required for pools
  which were cleanly exported if the system and pool hostids don't
  match.

- mmp_inactive_import: an activity check is now required for any
  pool which wasn't cleanly exported, even if the system and pool
  hostids match.

- mmp_on_uberblocks: updated expected uberblocks to take in to account
  the value MMP_INTERVAL_DEFAULT is set too.

- mmp_reset_interval: reduce the number of iterations from 10 to 3.
  This is sufficient to verify functionality and significantly speeds
  up the test.

- mmp_on_uberblocks: adjust the thresholds and increase the runtime
  to avoid false positives observed in CI.

- Update tests to use 'zhack action idle' instead of ztest to improve
  the reliability of the tests.

- Add additional log_note messages to test cases which have multiple
  verification steps to make it clear which portion of a test failed
  when reviewing the logs.

- Replace default_setup/cleanup_noexit calls with 'zpool create' and
  'zpool destroy' calls to avoid additional unnecessary dataset
  creation work.

- Update activity/noactivity check helper functions to use the
  ZFS_LOAD_INFO_DEBUG information now available from 'zpool import'
  to determine if this activity check ran and why.  This is more
  reliable in the CI than measuring the runtime.

- Removed all mmp tests from the zts-report.py exceptions list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:36:18 -08:00
Brian Behlendorf
20176224ee mmp: claim sequence id before final import
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts.  This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect.  This extra
safety chceck operates as follows:

1. Repeats the following MMP check 10 times:
  a. Write out an MMP uberblock with the best txg and a random
     sequence id to all primary pool vdevs.
  b. Verify a minimum number of good writes such that even if
     the pool appears degraded on the remote host it will see
     at least one of the updated MMP uberblocks.
  c. Wait for the MMP interval this leaves a window for other
     racing hosts to make similar modifications which can be
     detected.
  d. Call vdev_uberblock_load() to determine the best uberblock
     to use, this should be the MMP uberblock just written.
  e. Verify the txg and random sequeunce number match the MMP
     uberblock written in 1a.

2. Restore the original MMP uberblocks.  This allows the check
   to be performed again if the pool fails to import for an
   unrelated reason.

This change also includes some refactoring and minor improvements.

- Never try loading earlier txgs during import when the import
  fails with EREMOTEIO or EINTER.  These errors don't indicate
  the txg is damaged but instead that its either in use on a
  remote host or the import was interactively cancelled.  No
  rewind is also performed for EBADD which can result from a
  stale trusted config when doing a verbatim import.

- Refactor the code for consistent logging of the multihost
  activity check using spa_load_note() and console messages
  indicating when the activity check was trigger and the result.

- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
  modification of the sequence number in an uberblock.

- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
  set to log to dump to stdout the spa_load_info nvlist returned
  during import.  This is used by the updated mmp test cases
  to determine if an activity check was run and its result.

- Standardize the mmp messages similarly to make it easier to
  find all the relevent mmp lines in the debug log.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:36:01 -08:00
Brian Behlendorf
2f048ced4d mmp: add spa_load_name() for tryimport
Tryimport adds a unique prefix to the pool name to avoid name
collisions.  This makes it awkward to log user-friendly info
during a tryimport.  Add a spa_load_name() function which can
be used to report the unmodified pool name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:35:03 -08:00
Brian Behlendorf
62a1bf7d19 mmp: move "Starting import" log message
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:34:57 -08:00
Brian Behlendorf
a9564b1787 mmp: further restrict mmp exported pool check
For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check.  Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
2026-02-09 09:32:58 -08:00
Austin Wise
4f180e095a
Fix activating large_microzap on receive
This ensures that the in-memory state of the feature is recorded and
that `dsl_dataset_activate_feature` is not called when the feature
is already active.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes #18143
Closes #18144
2026-02-05 15:48:03 -08:00
Alexander Motin
21bbe7cb67
Improve caching for dbuf prefetches
To avoid read errors with transaction open dmu_tx_check_ioerr()
is used to read everything required in advance.  But there seems
to be a chance for the buffer to evicted from dbuf cache in
between, which result in immediate eviction from ARC, which may
require additional disk read later in a place where error handling
is problematic.

To partially workaround this introduce a new flag DMU_IS_PREFETCH,
relayed to ARC as ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH,
making ARC delay eviction by at least several seconds, or till the
actual read inside the transaction, that will promote it to demand
access.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18160
2026-02-04 10:12:32 -08:00
Ameer Hamza
00d69b0f72 arc: remove unused l2df_size and l2df_type from l2arc_data_free_t
These fields became unused when ABD was introduced in a6255b7fc.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:07:26 -08:00
Ameer Hamza
d1f290f1ea L2ARC: Implement DWPD-based rate limiting with adaptive feed intervals
Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write
speeds and protect SSD endurance. Write rate is constrained by the
minimum of l2arc_write_max and DWPD-calculated budget. Devices
accumulate unused write budget over 24-hour periods with automatic reset
and carry-over. Writes occur in controlled bursts (max 50MB) with
adaptive intervals to achieve target rates. Applies after initial device
fill.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:07:07 -08:00
Ameer Hamza
b525525b44 L2ARC: Implement per-device feed threads for parallel writes
Transform L2ARC from single global feed thread to per-device threads,
enabling parallel writes to multiple L2ARC devices. Each device runs
its own feed thread independently, improving multi-device throughput.
Previously, a single thread served all devices sequentially; now each
device writes concurrently. Threads are created during device addition
and torn down on removal.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:07:02 -08:00
Ameer Hamza
825dc41ad4 L2ARC: Preserve L2HDR in arc_release() for in-flight writes
When arc_release() is called on a header with a single buffer and
L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar
to the arc_hdr_destroy() case). If we destroy the L2HDR here, later
arc_write() will allocate a new ABD and call arc_hdr_free_abd(),
which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing
VERIFY(HDR_HAS_L2HDR(hdr)) to fail.

Allocate a new header for the buffer in the single_buf_l2writing
case (single buffer + L2_WRITING), leaving the original header with
L2HDR intact. The original header becomes an "orphan" (no buffers, no
b_pabd) but retains device association for ABD cleanup when
l2arc_write_done() completes.

The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC
makes its own transformed copy via l2arc_apply_transforms(), so the
original ABD is not used by the L2 write. The header can be safely
reused without allocating a new one.

For proper evictable space accounting, arc_buf_remove() must be
called before remove_reference() in the single_buf_l2writing path.
This ensures arc_evictable_space_increment() (during remove_reference)
and arc_evictable_space_decrement() (during destruction) see the
same state (b_buf=NULL), preventing accounting leaks that cause
module unload to hang with non-zero esize.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:06:57 -08:00
Ameer Hamza
b8610c3d93 L2ARC: Reorder header destruction for in-flight L2 writes
With multiple L2ARC devices, headers can be destroyed asynchronously
(e.g., during zpool sync) while L2_WRITING is set. The original code
destroyed L2HDR before L1HDR, causing ABDs to lose their device
association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called.

This caused ABDs to be added to the global free-on-write list without
device information. When any L2ARC device completed its write and
attempted to free these orphaned ABDs, it would panic on
ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was
still part of another device's vdev_queue I/O aggregation gang.

Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and
reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This
ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag
ABDs with their device for deferred cleanup.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:06:51 -08:00
Ameer Hamza
2f41b9d865 L2ARC: Implement persistent markers with consistent tail scanning
This commit introduces per-sublist persistent markers that eliminate
redundant tail scanning between L2ARC iterations, providing significant
CPU efficiency improvements. Markers are pre-allocated during device
initialization and properly cleaned up during device removal.

The implementation uses conditional behavior based on device capacity:
small devices (capacity < arc_c) retain original HEAD/TAIL scanning
based on ARC warmup state, while large devices (capacity >= arc_c)
use the persistent marker approach for optimal CPU efficiency.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:06:47 -08:00
Ameer Hamza
3523b5f3f9 L2ARC: Implement even-depth multi-sublist scanning
The introduction of ARC multilists made L2ARC writing quite random,
depending on whether it found something to write in a randomly selected
sublist. This created inconsistent write patterns and poor utilization
of available sublists leading to uneven cache population.

This commit replaces random selection with systematic scanning across
all sublists within each burst. Fair headroom distribution ensures
even-depth traversal across all sublists until the target write size
is reached. Round-robin processing with random starting points eliminates
sequential bias while maintaining predictable write behavior.

The systematic approach provides consistent L2ARC filling patterns
and better utilization of available ARC data across all sublists.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #18093
2026-02-04 10:05:53 -08:00
Mariusz Zaborski
a157ef62a1
Make sure we can still write data to txg
The final txgs are used only to clear out any remaining deferred
frees, and we cannot write new data to them. Make sure we do not
try to do so.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes #18139
2026-01-26 21:33:21 -05:00
Alexander Motin
35b2d39709
Lock db_mtx around arc_release() in couple places
* Lock db_mtx around arc_release() in dbuf_release_bp()

While this function is called only in sync context, the same buffer
can be touched by dbuf_hold_impl() in open context, creating races.
All other accesses to arc_release() are already protected by db_mtx,
so just take it here too.

Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>

* Lock db_mtx in sa_byteswap()

While SA code seems protected by sa_lock, there is a back door of
dmu_objset_userquota_get_ids(), that may hold and access the dbuf
without sa_lock, relying only on db_mtx. Taking db_mtx here should
protect both the arc_release() and the data for db_buf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18146
2026-01-26 21:32:16 -05:00
Martin Matuška
8605bdfdda
FreeBSD: unbreak compilation on i386
tests/zfs-tests/cmd/mmap_seek.c: use correct printf specifier
module/zfs/vdev.c: vdev_clear(): correctly cast argument to
atomic_add_64().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #18096
2026-01-14 17:02:41 -08:00
shuppy
09e4e01e93
Fix history logging for zpool create -t
`zpool create` is supposed to log the command to the new pool’s history,
as a special record that never gets evicted from the ring buffer. but
when you create a pool with `zpool create -t`, no such record is ever
logged (#18102). that bug may be the cause of issues like #16408.

`zpool create -t` (83e9986f6e) and `zpool
import -t` (26b42f3f9d) are both designed
to override the on-disk zpool property `name` with an in-core
“temporary” name, but they work somewhat differently under the hood.

importing with a temporary name sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` in ZFS_IOC_POOL_IMPORT, which tells
spa_write_cachefile() and spa_config_generate() to use the
ZPOOL_CONFIG_POOL_NAME in `spa->spa_config` instead of `spa->spa_name`.

creating with a temporary name permanently(!) sets the internal zpool
property `tname` (ZPOOL_PROP_TNAME) in the `zc->zc_nvlist_src` of
ZFS_IOC_POOL_CREATE, which tells zfs_ioc_pool_create()
(4ceb8dd6fd) and spa_create() to use that
name instead of `zc->zc_name`, then sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` like an import.

but zfsdev_ioctl_common() fails to check for `tname` when saving the
pool name to `zfs_allow_log_key`, so when we call ZFS_IOC_LOG_HISTORY,
we call spa_open() on the wrong pool name and get ENOENT, so the logging
silently fails.

this patch fixes #18102 by checking for `tname` in zfsdev_ioctl_common()
like we do in zfs_ioc_pool_create().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: delan azabani <dazabani@igalia.com>
Closes #18118  
Closes #18102
2026-01-14 14:51:51 -08:00
Alexander Motin
765929cb4e
DDT: Add locking for table ZAP destruction
Similar to BRT, DDT ZAP can be destroyed by sync context when it
becomes empty.  Respectively similar to BRT introduce RW-lock to
protect open context methods from the destruction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18115
2026-01-13 15:07:15 -08:00
Austin Wise
794f1587db
When receiving a stream with the large block flag, activate feature
ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS
to indicate the presence of large blocks in the dataset. On the sending
side, this flag is included if the `-L` flag is passed to `zfs send`
and the feature is active in the dataset. On the receive side, the
stream is refused if the feature is active in the destination dataset
but the stream does not include the feature flag.

The problem is the feature is only activated when a large block is
born. If a large block has been born in the destination, but never
the source, the send can't work. This can arise when sending streams
back and forth between two datasets.

This commit fixes the problem by always activating the large blocks
feature when receiving a stream with the large block feature flag.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes #18105
2026-01-07 16:47:12 -08:00
Wolfgang Hoschek
c77f17b750
Add snapshots_changed_nsecs dataset property
Add a read-only dataset property, snapshots_changed_nsecs, which 
exposes the nanosecond resolution version of snapshots_changed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Wolfgang Hoschek <wolfgang.hoschek@mac.com>
Closes #17998
Closes #18031
2026-01-06 09:36:20 -08:00
Rob Norris
654e7628d6 u8_textprep: move into module/zfs
Now that it's built into the main zfs module in all cases, there's no
reason to put it in its own dir.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #18071
2025-12-22 14:58:36 -08:00
Alexander Motin
962e68865e
Use reduced precision for scan times
Scan time limits do not need precision beyond 1ms.  Switching
scn_sync_start_time and spa_sync_starttime from gethrtime() to
getlrtime() saves ~3% of CPU time during resilver scan stage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18061
2025-12-18 10:22:11 -08:00
Alexander Motin
a83bb15fcd
Reduce minimal scrub/resilver times
With higher throughput and lower latency of modern devices ZFS can
happily live with pretty short (fractions of a second) TXGs.  But
the two decade old multi-second minimal time limits can almost stop
payload writes by extending TXGs beyond dirty data limits of ARC
ability to amortize it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18060
2025-12-18 10:21:45 -08:00
Alexander Motin
051a8c7494
Bypass snprintf() in quota checks if no quotas set
This improves synthetic 1 byte write speed by ~2.5%.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18063
2025-12-17 21:59:47 -05:00
Alexander Motin
0550abd4b8
RAIDZ: Remove some excessive logging
There were some per I/O logging into dbgmsg in RAIDZ code, that
increased CPU load and wiped useful content out of dbgmsg, for
example during routine disk replacement process.  I don't think
we need it to be that verbose.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18059
2025-12-17 14:00:01 -08:00
Alexander Motin
22e89aca88
DDT: Fix compressed entry buffer size
The first byte of the entry after compression is used for algorithm
and byte order flag.  We should decrement when calling compression/
decompression algorithm.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18055
2025-12-15 14:52:44 -08:00
Alexander Motin
3b1ff816bd
DDT: Add/use zap_lookup_length_uint64_by_dnode()
Unlike other ZAP consumers due to compression DDT does not know
how big entry it is reading from ZAP.  Due to this it called
zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(),
each of which does full ZAP entry lookup.

Introduction of the combined ZAP method dramatically reduces the
CPU overhead and locks contention at DBUF layer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18048
2025-12-15 14:38:34 -08:00
Alexander Motin
ff5414406f
DDT: Switch to using ZAP _by_dnode() interfaces
As was previously done for BRT, avoid holding/releasing DDT ZAP
dnodes for every access.  Instead hold the dnodes during all their
life time, never releasing.

While at this, add _by_dnode() interfaces for zap_length_uint64()
and zap_count(), actively used by DDT code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18047
2025-12-15 09:49:14 -08:00
Alexander Motin
46d6f1fe56
DDT: Move logs searches out of the lock
Postponing entry removal from the DDT log in case of hit till later
single-threaded sync stage allows to make ddl_tree stable during
multi-threaded ZIO processing stage.  It allows to drop the DDT lock
before the search instead of after, reducing the contention a lot.

Actually ddt_log_update_entry() was already handling the case of
entry present in the active log, so we only need to remove it from
flushing log, if the entry happen to be there.

My tests with parallel 4KB block writes show throughput increase
from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even
though still limited by the global DDT lock contention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18044
2025-12-15 09:17:04 -08:00
Alexander Motin
3d76ba2737
Improve async destroy processing timing
Previous code effectively enforced that all async free ZIOs were
_issued_ within the TXG timeout.  But they could take forever to
complete, especially if the required metadata were not in ARC.

This patch introduces periodic waits every 2000 ZIOs, which should
give at least somewhat reasonable TXG timings even for single HDD
pools with empty ARC.  And makes them complete within half of the
TXG timeout, since we might still need time to sync DDT and BRT.

While there, change zfs_max_async_dedup_frees semantics to include
also clone and gang blocks, which are similar.  Bump the default
value from set long ago to be more forgiving to block cloning
(still not having logs and benefiting from large TXGs), now that
we have better working time limits.  The limit now is a possible
amount of dirty data produced by BRT updates.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18043
2025-12-11 18:46:08 -08:00
Alexander Motin
f72fd378c8 Defer async destroys on pool import
We've observed a number of cases when pool import stuck for many
minutes due to large async destroy trying to load DDT or BRT from
HDD pool.  While proper destroy dosage is a separate problem,
lets give import process a chance to complete before that at all.
It may be not enough if there is a lot of ZIL to replay, but that
is harder to cover, since those are in separate syscalls.

Code investigation shown that we already have this mechanism used
for scrub/resilver, so this patch converts SCAN_IMPORT_WAIT_TXGS
into a tunable and applies it to async destroys also.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18033
2025-12-11 18:44:46 -08:00
Alexander Motin
d393166c54
ARC: Increase parallel eviction batching
Before parallel eviction implementation zfs_arc_evict_batch_limit
caused loop exits after evicting 10 headers.  The cost of it is not
big and well motivated.  Now though taskq task exit after the same
10 headers is much more expensive.  To cover the context switch
overhead of taskq introduce another level of batching, controlled
by zfs_arc_evict_batches_limit tunable, used only for parallel
eviction.

My tests including 36 parallel reads with 4KB recordsize that shown
1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention,
now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17970
2025-12-10 13:03:01 -08:00
Chunwei Chen
0c194352b5
Fix ddtprune causing space leak
In zio_ddt_free, if a pruned dde is still in ddt, it would do nothing
and cause space leak.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #17982
Closes #17983
2025-12-10 10:02:14 -08:00