Commit Graph

3557 Commits

Author SHA1 Message Date
Alexander Motin
a62c62120e
ARC: Pre-convert zfs_arc_min_prefetch_ms
There is no need to do MSEC_TO_TICK() for each evicted ARC header.
We can do it when tunables are set, since we already have separate
internal variables for those.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17965
2025-12-09 12:07:10 -08:00
Alexander Motin
09492e0f21
Reduce dataset buffers re-dirtying
For each block written or freed ZFS dirties ds_dbuf of the dataset.
While dbuf_dirty() has a fast path for already dirty dbufs, it still
require taking the lock and doing some things visible in profiler.

Investigation shown ds_dbuf dirtying by dsl_dataset_block_born()
and some of dsl_dataset_block_kill() are just not needed, since
by the time they are called in sync context the ds_dbuf is already
dirtied by dsl_dataset_sync().

Tests show this reducing large file deletion time by ~3% by saving
CPU time of single-threaded part of the sync thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #18028
2025-12-09 09:18:09 -08:00
bspengler-oss
2cab0554c0 Preserve LIFO ordering of kmap ops in abd_raidz_gen_iterate()
ZFS typically preserves proper LIFO ordering regarding map/unmap
operations that wrap the Linux kernel's kmap interfaces that
require such ordering, but one instance in abd_raidz_gen_iterate()
did not.

Similar issues have been fixed in the Linux kernel in the past,
see for instance CVE-2025-39899 for userfaultfd.

Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes #15668
Closes #18030
2025-12-09 09:12:16 -08:00
Ameer Hamza
88d012a1d6
Fix snapshot automount expiry cancellation deadlock
A deadlock occurs when snapshot expiry tasks are cancelled while holding
locks. The snapshot expiry task (snapentry_expire) spawns an umount
process and waits for it to complete. Concurrently, ARC memory pressure
triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the
expiry task while holding locks. The umount process spawned by the
expiry task blocks trying to acquire locks held by arc_prune, which is
blocked waiting for the expiry task to complete. This creates a circular
dependency: expiry task waits for umount, umount waits for arc_prune,
arc_prune waits for expiry task.

Fix by adding non-blocking cancellation support to taskq_cancel_id().
The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to
reschedule the unmount, which needs to cancel any existing expiry task.
It now uses non-blocking cancellation to avoid waiting while holding
locks, breaking the deadlock by returning immediately when the task is
already running.

The per-entry se_taskqid_lock has been removed, with all taskqid
operations now protected by the global zfs_snapshot_lock held as
WRITER. Additionally, an se_in_umount flag prevents recursive waits when
zfsctl_destroy() is called during unmount. The taskqid is now only
cleared by the caller on successful cancellation; running tasks clear
their own taskqid upon completion.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17941
2025-12-01 14:43:42 -08:00
Alexander Motin
928eccc5bc
DDT: Reduce global DDT lock scope during writes
Before this change DDT lock was taken 4 times per written block,
and as effectively a pool-wide lock it can be highly congested.
This change introduces a new per-entry dde_io_lock, protecting some
fields during I/O ready and done stages, so that we don't need the
global lock there.

According to my write tests on 64-thread system with 4KB blocks this
significantly reduce the global lock contention, reducing CPU usage
from 100% to expected ~80%, and increasing write throughput by 10%.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17960
2025-12-01 10:44:10 -08:00
Alexander Motin
a5b665df39
DDT: Switch to using wmsums for lookup stats
ddt_lookup() is a very busy code under a highly congested global
lock.  Anything we can save here is very important.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17980
2025-12-01 10:36:31 -08:00
Alexander Motin
48f33c1ef2
DDT: Make children writes inherit allocator
Even though unlike gang children it is not so critical for dedup
children to inherit parent's allocator, there is still no reason
for them to have allocation policy different from normal writes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17961
2025-12-01 10:30:27 -08:00
Alexx Saver
39303febac
chksum: run 256K benchmark on demand, preserve chksum_stat_data
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexx Saver <lzsaver.eth@ethermail.io>
Co-authored-by: Adam Moss <c@yotes.com>
Closes #17945
Closes #17946
2025-12-01 10:14:52 -08:00
Rob Norris
faa295b9a6 libspl: move SID definitions from zfs_context.h; remove kernel gate
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17861
2025-11-12 10:01:48 -08:00
Mariusz Zaborski
02fdd26e51
Add knob to disable slow io notifications
Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that
allows users to disable notifications for slow devices.
This prevents ZED and/or ZFSD from degrading the pool due to slow
I/O.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Closes 17477
2025-11-11 10:42:17 -08:00
Alexander Motin
b4f073b5a6
Add BRT support to zpool prefetch command
Implement BRT (Block Reference Table) prefetch functionality similar
to existing DDT prefetch.  This allows preloading BRT metadata into
ARC to improve performance for block cloning operations and frees
of earlier cloned blocks.

Make -t parameter optional.  When omitted, prefetch all supported
metadata types (both DDT and BRT now).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17890
2025-11-10 16:16:22 -08:00
Alexander Motin
cc5cae5475
BRT: Increase block size from 4KB to 8KB
According to my observations, BRT ZAPs are typically compressible
3:1 for data and 2:1 for indirects.  With ashift=12, typical these
days, it means increasing the block sizes to 8KB we may get most
of possible compression, reducing on-disk and in-ARC BRT footprint
in half by the cost of some compression/decompression overhead,
but without real write inflation, only some dirty data increase.

Increase to 32KB similar to DDT could further increase compression
and storage efficiency, but at the cost of write inflation and
much bigger dirty data increase, which we can not properly control
now.  So lets leave this for a time when BRT log gets implemented.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17916
2025-11-10 15:44:46 -08:00
Alexander Motin
72b2a9571a
ZAP: Remove dmu_object_info_from_dnode() call
dmu_object_info_from_dnode() takes two locks and copies plenty of
data that we don't need in zap_lockdir_impl().  Just read dn_type
directly in this hot path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17921
2025-11-10 14:26:15 -08:00
Rob Norris
6e12f0bd77
spa_misc: add an API for spa_namespace_lock
This is useful as debugging support, as it lets namespace lock
operations be traced directly. It will also be useful for future work to
reduce the use of spa_namespace_lock, traditionally a source of
difficult deadlocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17906
2025-11-10 14:23:39 -08:00
Alexander Motin
baefe098ee
ZIO: Set minimum number of free issue threads to 32
Free issue threads might block waiting for synchronous DDT, BRT or
GANG header reads. So unlike other taskqs using ZTI_SCALE to scale
with number of CPUs, here we also need some amount of threads to
potentially saturate pool reads.  I am not sure we always want the
96 threads we had before ZTI_SCALE introduction at #11966 on small
systems, but lets make it at least 32.

While here, make free taskqs configurable, similar to read and
write ones.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17903
2025-11-08 14:41:53 -05:00
Adi-Goll
54876ee85e
Fix typo in vdev_raidz.c
Change the spelling of "begining" on line 4875 to
"beginning".

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com>
Closes #17905
2025-11-07 09:55:03 -08:00
Paul Dagnelie
8c225ff1b4
Fix gang write late_arrival bug
When a write comes in via dmu_sync_late_arrival, its txg is equal to the
open TXG. If that write gangs, and we have not yet activated the new
gang header feature, and the gang header we pick can store a larger gang
header, we will try to schedule the upgrade for the open TXG + 1. In
debug mode, this causes an assertion to trip. This PR sets the TXG for
activating the feature to be the larger of either the current open TXG
or the syncing TXG + 1.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17824
2025-11-05 11:40:22 -08:00
Robert Evans
d0294aa758
Update dnode_next_offset_level to accept blkid instead of offset
Currently this function uses L0 offsets which:
1. is hard to read since it maps offsets to blkid and back each call
2. necessitates dnode_next_block to handle edge cases at limits
3. makes it hard to tell if the traversal can loop infinitely

Instead, update this and dnode_next_offset to work in (blkid, index).
This way the blkid manipulations are clear, and it's also clear that
the traversal always terminates since blkid goes one direction.

I've also considered updating dnode_next_offset to operate on blkid.
Callers use both patterns, so maybe another PR can split the cases?

While here tidy up dnode_next_offset_level comments.

Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Robert Evans <evansr@google.com>
Closes #17792
2025-11-04 13:12:17 -08:00
Alexander Motin
6cfc3dba9c
Cleanup ZIO_FLAG_IO_RETRY vs TRYHARD usage
In cases where all issued ZIOs must succeed, and we can't do
anything clever about the errors, we should just explicitly set
ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries.

In other cases, where retries can be different from the original,
for example, some ZIOs are allowed to fail due to redundancy, or
we can disable aggregation on retrial to get at least some of
the data, we can do first pass without TRYHARD, and only if needed
retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics).

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17877
2025-10-30 16:29:48 -07:00
Alexander Motin
ec268cdf97 Fix caching of DDT log and BRT
Both DDT log and BRT counters we read on pool import and then only
append or overwrite in full blocks.  We don't need them in DMU or
ARC caches.  Fortunately we have DMU_UNCACHEDIO for this now.

Even more we don't need BRT in non-evictable metadata DMU caches,
since it will likely never fit there, while block the cache from
its original users.  Since DMU_OT_IS_METADATA_CACHED() has no way
to differentiate the new metadata types, mark BRT with storage
type of DMU_OT_DDT_ZAP.  As side effect it will also put it on
dedup device, but that should actually be right.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17875
2025-10-30 16:28:28 -07:00
Alexander Motin
ea125eeb5d BRT: Round bv_entcount up to BRT_BLOCKSIZE
Since we set bv_mos_brtvdev block size, and since we keep dirty
bitmap at the same granularity, we should keep the allocations
and writes done with.  Otherwise it makes the last block write
short, that will be odd once we implement writing of only dirty
blocks, but also requires read-modify-write on DMU layer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17875
2025-10-30 16:28:05 -07:00
Alexander Motin
dcada084b9
Pass flags to more DMU write/hold functions
Over the time many of DMU functions got flags argument to control
prefetch, caching, etc.  Few functions though left without it, even
though closer look shown that many of them do not require prefetch
due to their access pattern.  This patch adds the flags argument to
dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(),
passing DMU_READ_NO_PREFETCH where applicable.

I am going to also pass DMU_UNCACHEDIO to some of them later.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17872
2025-10-29 11:17:51 -07:00
Andrew Walker
adacf020ce
Fix return value for setting zvol threading
We must return -1 instead of ENOENT if the special zvol threading
property set function can't locate the dataset (this would typically
happen with an encypted and unmounted zvol) so that the operation
gets inserted properly into the nvlist for operations to set. This
is because we want the property to be set once the zvol is
decrypted again.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes #17836
2025-10-20 15:21:40 -07:00
Brian Behlendorf
5a03e358fc
Update device removal documentation
Make a minor update to the 'zpool remove' man page to clarify both
raidz and draid pools do not support removal, and change sector to
ashift which is what we actually care about.

Update the big theory comment in vdev_removal.c to accurately reflect
which types of vdevs can be removed.  Furthermore, I've added some
discussion for the casual reader to briefly explain the top-level
vdev removal restrictions.  This has been a common area of confusion
and it's not intuitive where they come from without understanding
the implementation details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17847
2025-10-20 09:26:51 -04:00
Shreshth3
a5af3f2db7
arc: fix small typos
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com>
Closes #17840
2025-10-13 11:23:55 -07:00
Mark Johnston
a9f2a1f361
Fix the type of the raidz_outlier_check_interval_ms parameter
It's an hrtime_t, which is an unsigned long long.  In practice this is
just a U64.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #17833
2025-10-13 10:47:09 -07:00
Mateusz Guzik
346ecac61b
Annotate arc_buf_is_shared as __maybe_unused
Otherwise the compiler warns about it on production FreeBSD builds.

The routine proved resilient to attempts to ifdef on debug.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #17818
2025-10-06 16:43:20 -07:00
Igor Ostapenko
cb3c18a9a9 ddt prune: Add SCL_ZIO deadlock workaround
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes #17793
2025-10-01 15:17:09 -07:00
Igor Ostapenko
e829e2fd04 spa_config: Rename spa_config_enter_mmp() to spa_config_enter_priority()
Originally this was created for MMP, but now new cases are emerging
where the same mechanism is required. Hence the name's generalization.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes #17793
2025-10-01 15:16:04 -07:00
Robert Evans
8869caae5f
zinject: Introduce ready delay fault injection
This adds a pause to the ZIO pipeline in the ready stage for
matching I/O (data, dnode, or raw bookmark).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Robert Evans <evansr@google.com>
Closes #17787
2025-10-01 12:17:13 -07:00
hoshinomori
e4a407f29f
range_tree: drop duplicate zfs_ prefix from rs_set_fill_raw
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: hoshinomori <hoshinomori@owarisekai.moe>
Closes #17800
2025-09-29 16:38:52 -07:00
Robert Evans
26b0f561be
dnode_next_offset: backtrack if lower level does not match
This changes the basic search algorithm from a single search up and down
the tree to a full depth-first traversal to handle conditions where the
tree matches at a higher level but not a lower level.

Normally higher level blocks always point to matching blocks, but there
are cases where this does not happen:

1. Racing block pointer updates from dbuf_write_ready.

   Before f664f1ee7f (#8946), both dbuf_write_ready and
   dnode_next_offset held dn_struct_rwlock which protected against
   pointer writes from concurrent syncs.

   This no longer applies, so sync context can f.e. clear or fill all
   L1->L0 BPs before the L2->L1 BP and higher BP's are updated.

   dnode_free_range in particular can reach this case and skip over L1
   blocks that need to be dirtied. Later, sync will panic in
   free_children when trying to clear a non-dirty indirect block.

   This case was found with ztest.

2. txg > 0, non-hole case. This is #11196.

   Freeing blocks/dnodes breaks the assumption that a match at a higher
   level implies a match at a lower level when filtering txg > 0.

   Whenever some but not all L0 blocks are freed, the parent L1 block is
   rewritten. Its updated L2->L1 BP reflects a newer birth txg.

   Later when searching by txg, if the L1 block matches since the txg is
   newer, it is possible that none of the remaining L1->L0 BPs match if
   none have been updated.

   The same behavior is possible with dnode search at L0.

   This is reachable from dsl_destroy_head for synchronous freeing.
   When this happens open context fails to free objects leaving sync
   context stuck freeing potentially many objects.

   This is also reachable from traverse_pool for extreme rewind where it
   is theoretically possible that datasets not dirtied after txg are
   skipped if the MOS has high enough indirection to trigger this case.

In both of these cases, without backtracking the search ends prematurely
as ESRCH result implies no more matches in the entire object.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Robert Evans <evansr@google.com>
Closes #16025
Closes #11196
2025-09-25 11:06:28 -07:00
Brian Behlendorf
c722bf8812
Add interface to interface spa_get_worst_case_min_alloc() function
Provide an interface to retrieve the lowest and highest minimum
allocation size for the normal allocation class.  This can be used
by external consumers of the DMU to estimate potential wasted
capacity when setting the recordsize for an object.

The new "min_alloc" and "max_alloc" keys are added to the pool
configuration and used by default_volblocksize() to warn when
an ineffecient block size is requested.  For older kmods which
don't yet include the new keys fallback to the previous logic.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17758
2025-09-25 09:35:35 -07:00
Alexander Motin
3f4312a0a4
Fix two infinite loops if dmu_prefetch_max set to zero
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17692
Closes #17729
2025-09-13 12:58:48 -04:00
Paul Dagnelie
9b772f328b
Fix time database update calculations
The time database update math assumed that the timestamps were in
nanoseconds, but at some point in the development or review process they
changed to seconds. This PR fixes the math to use seconds instead.
    
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17735
2025-09-12 16:33:36 -07:00
Alexander Motin
bc8bcfc71a
Fix type in dbrrd_closest()
For ABS() to work, the argument must be signed, but rrdd_time is
uint64_t.  Clang noticed it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Fixes #16853
Closes #17733
2025-09-12 11:05:38 -07:00
Chunwei Chen
37cd30f714
Fix ddle memleak in ddt_log_load
In ddt_log_load(), when removing dup entry from flushing tree, it doesn't
free the entry causing memleak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes #17657
Closes #17730
2025-09-12 10:05:06 -07:00
Allan Jude
7b1cc9eb61 ZFS allow send:encrypted
A new `zfs allow` permissions that ONLY allows sending replication
streams in raw (encrypted) mode, so encrypted data will not be
decrypted as part of the replication process.

Sponsored-by: Klara, Inc.
Sponsored-by: Karakun AG
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Co-authored-by: JT Pennington <jt.pennington@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #17543
2025-09-12 09:53:31 -07:00
Brian Behlendorf
bc0b5318aa
Prevent scrubbing a read-only pool
While it would be nice to be able to scrub a pool imported read-only
this will currently trip an ASSERT.  Before we can support this there
are some designs challenges which need to be thought through first.

For starters, a read-only import skips reading certain information 
from disk which it knows won't be needed, such as the space maps.
Furthermore, the scrub process expects to be checkpoint it's progress, 
update the on disk error log, and issue repair IO.  None of which 
would be possible when the pool is imported read-only.  

Each of these wrinkles can certainly be handled, but that will take 
some signifcant work.  In the meanwhile we disable the 'zpool scrub' 
command when the pool is imported read-only.

Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #17527
Closes #17717
2025-09-11 10:58:46 -07:00
Paul Dagnelie
d64711c202 Detect a slow raidz child during reads
A single slow responding disk can affect the overall read
performance of a raidz group.  When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.

Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.

The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter.  Setting it to
zero disables slow outlier detection.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Contributions-by: Don Brady <don.brady@klarasystems.com>
Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17227
2025-09-10 15:25:03 -07:00
Paul Dagnelie
0620c979a5 Remove RAIDZ reconstruct flags from debug defaults
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17227
2025-09-10 15:24:50 -07:00
Paul Dagnelie
bc4aac0395 Enable zhack to work properly with 4k sector size disks
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17576
2025-09-10 11:13:55 -07:00
Chunwei Chen
e3c3e86c04
Fix wrong dedup_table_size for legacy dedup
If we call ddt_log_load() for legacy ddt, we will end up going into
ddt_log_update_stats() and filling uninitialized value into ddo_dspace.
This value will then get added to dedup_table_size during
ddt_get_dedup_object_stats().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17019
Closes #17699

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
2025-09-08 14:02:51 -07:00
Rob Norris
ced72fdd69
tunables: remove legacy FreeBSD aliases
These are old pre-OpenZFS tunable names that have long been
available via either conventional ZFS_MODULE_PARAM tunables or through
kstats. There's no point doubling up anymore, so delete them.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17375
2025-09-08 10:03:01 -07:00
Rob Norris
64d3143e82
zvol: reject suspend attempts when zvol is shutting down
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17690
2025-09-03 11:13:09 -07:00
youzhongyang
b6bd3228bb
Synchronize the update of feature refcount
The concurrent execution of feature_sync() can lead to a panic due 
to an unprotected update of the feature refcount.  Resolve this by
using the spa->spa_feat_stats_lock to synchronize the update of the 
refcount.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #17184
Closes #17632
2025-08-22 16:35:58 -07:00
Rob Norris
574eec2964 dnode: remove dn_dirtyctx and dnode_dirtycontext
Only used for a couple of debug assertions which had very little value.

Setting it required taking certain locks, so we can remove all that too.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16297
Closes #17652
Closes #17658
2025-08-21 06:05:38 -07:00
Rob Norris
aa6f0f878b dnode: remove dn_dirtyctx_firstset
Old debug param, not used for anything.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16297
Closes #17652
Closes #17658
2025-08-21 06:05:36 -07:00
Rob Norris
eecff1b4a9 dnode: remove dn_dirty_txg and DNODE_IS_DIRTY
dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only
existed to ensure that a dnode was clean before making it eligible for
removal from the array of cached dnodes attached to the object 0 L0
dbuf.

dn_dirtycnt is enough to check that now, so use it directly and remove
the rest.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16297
Closes #17652
Closes #17658
2025-08-21 06:05:35 -07:00
Rob Norris
f3e49b0cf5 dnode_is_dirty: reimplement in terms of dn_dirtycnt
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16297
Closes #17652
Closes #17658
2025-08-21 06:05:33 -07:00
Rob Norris
3abf72b251 dnode: add dn_dirtycnt, count of number of txgs this dnode is dirty on
Bumped when we take the dirty hold in dnode_setdirty(), dropped when the
dnode is finally cleaned up after sync in dnode_rele_task() or
userquota_updates_task().

This gives us a way to check if the dnode is dirty on any txg without
having to rely on outside information (eg presence on a dirty list),
which has been a rich source of bugs in the past.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Suggested-by: Robert Evans <evansr@google.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16297
Closes #17652
Closes #17658
2025-08-21 06:05:29 -07:00
Rob Norris
dcd73069f0 zvol_remove_minors_impl: remove all async fallbacks
Since both ZFS- and OS-sides of a zvol now take care of their own
locking and don't get in each other's way, there's no need for the very
complicated removal code to fall back to async tasks if the locks needed
at each stage can't be obtained right now.

Here we change it to be a linear three-step process: select zvols of
interest and flag them for removal, then wait for them to shed activity
and then remove them, and finally, free them.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-19 10:06:47 -07:00
Rob Norris
8a0e5e8b54 zvol: stop using zvol_state_lock to protect OS-side private data
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).

This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.

The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.

However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.

While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-19 10:06:34 -07:00
Rob Norris
96f9d271ea zvol: remove the OS-side minor before freeing the zvol
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.

This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.

This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.

Equivalent changes are made on the FreeBSD side to follow the API
change.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-19 10:06:21 -07:00
Rob Norris
b2c792778c zvol: generalise zvol_remove_minors_impl() for single zvol case
zvol_remove_minor_impl() and zvol_remove_minors_impl() should be
identical except for how they select zvols to remove, so lets just use
the same function with a flag to indicate if we should include children
and snapshots or not.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17625
2025-08-19 10:06:11 -07:00
Brian Behlendorf
5061f959d1
Retire zfs_autoimport_disable kmod option
Back in 2014 the zfs_autoimport_disable module option was added to
control whether the kmods should load the pool configs from the cache
file on module load.  The default value since that time has been for
the kernel to not process the cache file.

Detecting and importing pools during boot is now controlled outside
of the kmod on both Linux and FreeBSD.  By all accounts this has been
working well and we can remove this dormant code on the kernel side.

The spa_config_load() function is has been moved to userspace, it is
now only used by libzpool.  Additionally, the spa_boot_init() hook
which was used by FreeBSD now looks to be used and was removed.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17618
2025-08-14 14:58:58 -07:00
Alexander Motin
d151432073
ZIL: Make allocations more flexible
When ZIL allocates space for new LWBs without knowing how much it
will require, it can use new metaslab_alloc_range() function to
allocate slightly more or less than it predicted.  It allows to
improve space efficiency by allocating bigger LWBs on RAIDZ/dRAID
instead of padding and possibly packing more ZIL records there.
It may also allow to reduce ganging in some cases by allowing to
allocate smaller LWBs when we are not sure we'll need bigger.

On the opposite side, when we allocate space for already closed
LWBs, when we precisely know how much space we need, we may just
allocate what we need instead of relying on writing less than
allocated, that does not work for RAIDZ.

Space for LWBs in open state (still being filled) is allocated
same as before.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17613
2025-08-14 08:50:17 -07:00
Alexander Motin
885d929cf8
Fix missed assertion update in physical rewrite patch
Physical rewrite patch changed the meaning of BP_GET_BIRTH(), but
I missed update one of its occurences, ending up asserting equal
logical birth times instead of equal physical birth times.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Fixes #17565
Closes #17631
2025-08-13 15:56:25 -04:00
Jitendra Patidar
077269bfed
Fix Assert in dbuf_undirty, which triggers during usage zap shrink
Usage zap's (DMU_*USED_OBJECT) are updated in syncing context via
do_userquota_cacheflush(). zap shrink triggers,
ASSERT(db->db_objset == dmu_objset_pool(db->db_objset)->dp_meta_objset
    || txg != spa_syncing_txg(dmu_objset_spa(db->db_objset)));

DMU_*USED_OBJECT are special object (DMU_OBJECT_IS_SPECIAL), gets
updated in syncing context only. So, relax assert for it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #17602
2025-08-12 14:19:05 -07:00
Brian Behlendorf
1ccae433e9
Allow vmem_alloc backed multilists
Systems with a large number of CPU cores (192+) may trigger the large
allocation warning in multilist_create() on Linux.  Silence the warning
by converting the allocation to vmem_alloc().

On Linux this results in a call to kvalloc() which will alloc vmem
for large allocations and kmem for small allocations.

On FreeBSD both vmem_alloc and kmem_alloc internally use the same
allocator so there is no functional change.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17616
2025-08-12 13:36:03 -07:00
Rob Norris
531568f438 zil_suspend: fix cookie leak if ZIL crashes during wait
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17622
2025-08-12 13:24:32 -07:00
Rob Norris
7c9adc6858 zil_process_commit_list: fail better if the pool suspends in stall
Make sure we properly inform the nolwb waiters of the error, and don't
keep trying.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17622
2025-08-12 13:24:27 -07:00
Rob Norris
f562e0f691 ZIL: single zil_commit_waiter_done() function to complete a waiter
Just making it easier to not get the locking and broadcast wrong.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17622
2025-08-12 13:24:22 -07:00
Rob Norris
92da3e18c8 ZIL: flag crashed LWBs so we know not to process them
If the ZIL crashed, any outstanding LWBs are no longer interesting, so
if they return, we need to just clean them up and return, not try to do
any work on them. This is true even if they return success, as that may
be long after the pool suspended and resumed, depending on when/if the
kernel decides to return the IO to us. In particular, we must not try to
get the "next" LWB from zl_lwb_list, since they're no longer on that
list.

So, we put a flag on in-flight LWBs in zil_crash() when we move them
from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when
they return.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17622
2025-08-12 13:24:16 -07:00
Rob Norris
508c546975 ZIL: use a bitfield for LWB "slog" and "slim" state flags
I'm soon about to need another LWB flag, and boolean_t is just so big
for only storing a single bit. Changing to a bitfield is far less
wasteful.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17622
2025-08-12 13:23:59 -07:00
Rob Norris
2fd145b578
zvol: cleanup error handling and passthrough
This is trying to get all the uses and non-uses of SET_ERROR correct
(being: only call it if we're the originator of an error _within ZFS_),
and correctly negating errors going to/from the kernel. And/or both.

Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17605
2025-08-08 17:04:01 -07:00
Rob Norris
391e85f519 ZIL: add zil_commit_flags() to make honouring failmode= optional
The vast majority of calls to zil_commit() follow VFS ops, and should
honour the failmode= setting - either wait for sync, or return error.
Some calls however are part of a larger syncing op, and shouldn't ever
block if something goes wrong.

To allow this, we introduce zil_commit_flags(), with a flag
ZIL_COMMIT_FAILMODE to indicate whether or not the pool failmode should
be honoured. zil_commit() is now a wrapper that always sets this flag,
but any caller wanting a different behaviour can request ZIL_COMMIT_NOW
instead to have the call return failure if the pool suspends, regardless
of the failmode= setting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:33 -07:00
Rob Norris
72602f6ad9 ZIL: "crash" the ZIL if the pool suspends during fallback
If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks
on suspend. We want it to not block on suspend, instead returning an
error. On the surface, this is simple: change all calls to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error
return back to the zil_commit() caller.

Handling suspension means returning an error to all commit waiters. This
is relatively straightforward, as zil_commit_waiter_t already has
zcw_zio_error to hold the write IO error, which signals a fallback to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the
waiter can now return an error from zil_commit().

However, commit waiters are normally signalled when their associated
write (LWB) completes. If the pool has suspended, those IOs may not
return for some time, or maybe not at all. We still want to signal those
waiters so they can return from zil_commit(). We have a list of those
in-flight LWBs on zl_lwb_list, so we can run through those, detach them
and signal them. The LWB itself is still in-flight, but no longer has
attached waiters, so when it returns there will be nothing to do.

(As an aside, ITXs can also supply completion callbacks, which are
called when they are destroyed. These are directly connected to LWBs
though, so are passed the error code and destroyed there too).

At this point, all ZIL waiters have been ejected, so we only have to
consider the internal state. We potentially still have ITXs that have
not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL
is in an unknown state; some writes may have been written but not
returned to us. We really can't rely on any of it; the best thing to do
is abandon it entirely and start over when the pool returns to service.
But, since we may have IO out that won't return until the pool resumes,
we need something for it to return to.

The simplest solution I could find, implemented here, is to "crash" the
ZIL: accept no new ITXs, make no further updates, and let it empty out
on its normal schedule, that is, as txgs complete and zil_sync() and
zil_clean() are called. We set a "restart txg" to three txgs in the
future (syncing + TXG_CONCURRENT_STATES), at which point all the
internal state will have been cleared out, and the ZIL can resume
operation (handled at the top of zil_clean()).

This commit adds zil_crash(), which handles all of the above:
 - sets the restart txg
 - capture and signal all waiters
 - zero the header

zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND)
returns because the pool suspended (ESHUTDOWN).

The rest of the commit is just threading the errors through, and related
housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:26 -07:00
Rob Norris
99a5f5d1ba ZIL: pass commit errors back to ITX callbacks
ITX callbacks are used to signal that something can be cleaned up after
a itx is committed. Presently that's only used when syncing out mapped
pages (msync()) to mark dirty pages clean.

This extends the callback interface so it can be passed an error, and
take a different cleanup action if necessary.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:20 -07:00
Rob Norris
967b15b888 ZIL: allow zil_commit() to fail with error
This changes zil_commit() to have an int return, and updates all callers
to check it. There are no corresponding internal changes yet; it will
always return 0.

Since zil_commit() is an indication that the caller _really_ wants the
associated data to be durability stored, I've annotated it with the
__warn_unused_result__ compiler attribute (via __must_check), to emit a
warning if it's ever ussd without doing something with the return code.
I hope this will mean we never misuse it in the future.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:09 -07:00
Rob Norris
82d6f7b047 Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:42 -07:00
Rob Norris
f7bdd84328 Prefer VERIFY0P(n) over VERIFY(n == NULL)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:37 -07:00
Rob Norris
611b95da18 Prefer VERIFY0(n) over VERIFY3S(n, ==, 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:32 -07:00
Rob Norris
5c7df3bcac Prefer VERIFY0(n) over VERIFY3U(n, ==, 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:25 -07:00
Rob Norris
c39e076f23 Prefer VERIFY0(n) over VERIFY(n == 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:40:59 -07:00
Rob Norris
e44e51f28d zvol_task_report_status: gate behind ZFS_DEBUG
dprintf() is a no-op in production builds, giving a compile warning. So,
refactor it a little to keep all the strings inside the function, and
then make the function a no-op when ZFS_DEBUG is not set.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Closes #17596
2025-08-07 11:36:15 -07:00
Rob Norris
e6eb03a991 zvol_check_volblocksize: fix spa ref leak
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Closes #17596
2025-08-07 11:36:09 -07:00
Rob Norris
3e671f2353 zvol: remove void return casts on void-returning functions
Casting unused returns to (void) is already of dubious value, but it's
entirely meaningless on functions that are defined as void return.
Remove the clutter.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Closes #17596
2025-08-07 11:34:20 -07:00
Alek P
3e004369f7
Removed unused zio_decompress_fail_fraction variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alek Pinchuk <alek.pinchuk@connectwise.com>
Closes #17599
2025-08-06 17:10:03 -07:00
Alexander Motin
60f714e6e2 Implement physical rewrites
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17565
2025-08-06 10:36:56 -07:00
Alexander Motin
4ae8bf406b Allow physical rewrite without logical
During regular block writes ZFS sets both logical and physical
birth times equal to the current TXG.  During dedup and block
cloning logical birth time is still set to the current TXG, but
physical may be copied from the original block that was used.
This represents the fact that logically user data has changed,
but the physically it is the same old block.

But block rewrite introduces a new situation, when block is not
changed logically, but stored in a different place of the pool.
From ARC, scrub and some other perspectives this is a new block,
but for example for user applications or incremental replication
it is not.  Somewhat similar thing happen during remap phase of
device removal, but in that case space blocks are still acounted
as allocated at their logical birth times.

This patch introduces a new "rewrite" flag in the block pointer
structure, allowing to differentiate physical rewrite (when the
block is actually reallocated at the physical birth time) from
the device reval case (when the logical birth time is used).

The new functionality is not used at this point, and the only
expected change is that error log is now kept in terms of physical
physical birth times, rather than logical, since if a block with
logged error was somehow rewritten, then the previous error does
not matter any more.

This change also introduces a new TRAVERSE_LOGICAL flag to the
traverse code, allowing zfs send, redact and diff to work in
context of logical birth times, ignoring physical-only rewrites.
It also changes nothing at this point due to lack of those writes,
but they will come in a following patch.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17565
2025-08-06 10:36:07 -07:00
Mariusz Zaborski
894edd084e
Add TXG timestamp database
This feature enables tracking of when TXGs are committed to disk,
providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based
on specific date ranges, improving the granularity of data
management and recovery operations.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16853
2025-08-06 10:31:21 -07:00
Rob Norris
a18c9edda6 Linux: sync: remove async/sync accounting
All this machinery is there to try to understand when there an async
writeback waiting to complete because the intent log callbacks are still
outstanding, and force them with a timely zil_commit(). The next commit
fixes this properly, so there's no need for all this extra housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17584
2025-08-06 09:54:30 -07:00
Paul Dagnelie
31c4fa93bb Fix dynamic gang block headers on raidz and mirror devices
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17587
2025-08-06 09:50:58 -07:00
Fedor Uporov
0b6fd024a7
ZVOL: Unify zvol minors operations and improve error handling
Now zvol minors creation logic is passed thru spa_zvol_taskq, like it
is doing for remove/rename zvol minors functions. Appropriate
zvol minors creation functions are refactored:
- The zvol_create_minor()/zvol_minors_create_recursive() were removed.
- The single zvol_create_minors() is added instead.

Also, it become possible to collect zvol minors subtasks status, to
detect, if some zvol minor subtask is failed in the subtasks chain.
The appropriate message is reported to zfs_dbgmsg buffer in this case.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17575
2025-08-06 10:10:52 -04:00
khoang98
0f8a1105ee
Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Kaitlin Hoang <kthoang@amazon.com>
Closes #17561
2025-08-01 16:47:41 -07:00
Igor Ostapenko
cb5e7e097d
range_tree: Provide more debug details upon unexpected add/remove
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes #17581
2025-07-31 10:44:42 -04:00
Alexander Motin
f70c85086b
BRT: Fix ZAP entry endianness
During original block cloning implementation a mistake was made,
making BRT ZAP entries an array of 8 1-byte entries instead of 1
entry of 8 bytes. This makes the pools non-endian-safe.

This commit introduces a new read-compatible pool feature
"com.truenas:block_cloning_endian", fixing the endianness issue
for new pools while maintaining compatibility with existing ones.

The feature is automatically activated when creating the first BRT
ZAP (ensuring we don't activate it on pools that already have BRT
entries in the old format).  When active, BRT entries are stored
as single 8-byte values.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17572
2025-07-30 09:42:47 -07:00
Tino Reichardt
10a78e2647
Faster checksum benchmark on system boot
While booting, only the needed 256KiB benchmarks are done now.

The delay for checking all checksums occurs when requested via:
- Linux: cat /proc/spl/kstat/zfs/chksum_bench
- FreeBSD: sysctl kstat.zfs.misc.chksum_bench

Reported by: Lahiru Gunathilake <gunathilakebllg@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Colin Percival <cperciva@tarsnap.com>
Closes #17563
Closes #17560
2025-07-29 17:09:48 -07:00
Paul Dagnelie
fc885f308f
Don't use wrong weight when passivating group
When we're passivating a metaslab group we start by passivating the 
metaslabs that have been activated for each of the allocators.  To do 
that, we need to provide a weight. However, currently this erroneously 
always uses a segment-based weight, even if segment-based weighting is 
disabled.

Use the normal weight function, which will decide which type of weight 
to use.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17566
2025-07-29 14:28:01 -07:00
Brian Behlendorf
cf146460c1
Default to zfs_bclone_wait_dirty=1
Update the default FICLONE and FICLONERANGE ioctl behavior to wait
on dirty blocks.  While this does remove some control from the
application, in practice ZFS is better positioned to the optimial
thing and immediately force a TXG sync.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17455
2025-07-25 10:42:23 -04:00
Rob Norris
bf38c15071 everywhere: misc unnecessary var init/update
These are all cases where we initialise or update a variable, and then
never use it. None of them particularly matter, as the compiler should
optimise them all away during dead store elimination, but some static
analysers complain about them and they are extra work for casual readers
to follow, so worth removing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:58 -07:00
Rob Norris
d2b9e66b88 vdev_raidz: asize/psize: remove unnecessary var initialisation
It would have been optimised away anyway so it doesn't matter, but it
does make things a little tougher to read.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:51 -07:00
Rob Norris
2755e2aa60 spa_activity_check: narrow scope of MMP vars
They aren't used outside these very small blocks, and their initial
values are never used at all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:07 -07:00
shodanshok
a7a144e655
enforce arc_dnode_limit
Linux kernel shrinker in the context of null/root memcg does not scan
dentry and inode caches added by a task running in non-root memcg. For
ZFS this means that dnode cache routinely overflows, evicting valuable
meta/data and putting additional memory pressure on the system.

This patch restores zfs_prune_aliases as fallback when the kernel
shrinker does nothing, enabling zfs to actually free dnodes. Moreover,
it (indirectly) calls arc_evict when dnode_size > dnode_limit.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17487
Closes #17542
2025-07-21 10:32:01 -07:00
Alexander Motin
be1e991a1a
Allow and prefer special vdevs as ZIL
Before this change ZIL blocks were allocated only from normal or
SLOG vdevs.  In typical situation when special vdevs are SSDs and
normal are HDDs it could cause weird inversions when data blocks
are written to SSDs, but ZIL referencing them to HDDs.

This change assumes that special vdevs typically have much better
(or at least not worse) latency than normal, and so in absence of
SLOGs should store ZIL blocks.  It means similar to normal vdevs
introduction of special embedded log allocation class and updating
the allocation fallback order to: SLOG -> special embedded log ->
special -> normal embedded log -> normal.

The code tries to guess whether data block is going to be written
to normal or special vdev (it can not be done precisely before
compression) and prefer indirect writes for blocks written to a
special vdev to avoid double-write.  For blocks that are going to
be written to normal vdev, special vdev by default plays as SLOG,
reducing write latency by the cost of higher special vdev wear,
but it is tunable via module parameter.

This should allow HDD pools with decent SSD as special vdev to
work under synchronous workloads without requiring additional
SLOG SSD, impractical in many scenarios.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17505
2025-07-18 18:44:14 -07:00
Alexander Motin
d7ab07dfb4
ZIL: Force writing of open LWB on suspend
Under parallel workloads ZIL may delay writes of open LWBs that
are not full enough.  On suspend we do not expect anything new to
appear since zil_get_commit_list() will not let it pass, only
returning TXG number to wait for.  But I suspect that waiting for
the TXG commit without having the last LWB issued may not wait for
its completion, resulting in panic described in #17509.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17521
2025-07-17 15:31:19 -07:00
Paul Dagnelie
c1e51c55f5
Correct weight recalculation of space-based metaslabs
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes #17531
2025-07-16 10:20:57 -07:00
Paul Dagnelie
a981cb69e4 Implement dynamic gang header sizes
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17004
2025-07-09 14:02:53 -07:00
Rob Norris
6af8db61b1
metaslab: don't pass whole zio to throttle reserve APIs
They only need a couple of fields, and passing the whole thing just
invites fiddling around inside it, like modifying flags, which then
makes it much harder to understand the zio state from inside zio.c.

We move the flag update to just after a successful throttle in zio.c.

Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED
Better describes what it means, and makes it look less like
IO_IS_ALLOCATING, which means something different.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17508
2025-07-04 23:22:22 -04:00