In cases where all issued ZIOs must succeed, and we can't do
anything clever about the errors, we should just explicitly set
ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries.
In other cases, where retries can be different from the original,
for example, some ZIOs are allowed to fail due to redundancy, or
we can disable aggregation on retrial to get at least some of
the data, we can do first pass without TRYHARD, and only if needed
retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics).
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17877
Both DDT log and BRT counters we read on pool import and then only
append or overwrite in full blocks. We don't need them in DMU or
ARC caches. Fortunately we have DMU_UNCACHEDIO for this now.
Even more we don't need BRT in non-evictable metadata DMU caches,
since it will likely never fit there, while block the cache from
its original users. Since DMU_OT_IS_METADATA_CACHED() has no way
to differentiate the new metadata types, mark BRT with storage
type of DMU_OT_DDT_ZAP. As side effect it will also put it on
dedup device, but that should actually be right.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17875
Since we set bv_mos_brtvdev block size, and since we keep dirty
bitmap at the same granularity, we should keep the allocations
and writes done with. Otherwise it makes the last block write
short, that will be odd once we implement writing of only dirty
blocks, but also requires read-modify-write on DMU layer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17875
Over the time many of DMU functions got flags argument to control
prefetch, caching, etc. Few functions though left without it, even
though closer look shown that many of them do not require prefetch
due to their access pattern. This patch adds the flags argument to
dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(),
passing DMU_READ_NO_PREFETCH where applicable.
I am going to also pass DMU_UNCACHEDIO to some of them later.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17872
We must return -1 instead of ENOENT if the special zvol threading
property set function can't locate the dataset (this would typically
happen with an encypted and unmounted zvol) so that the operation
gets inserted properly into the nvlist for operations to set. This
is because we want the property to be set once the zvol is
decrypted again.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes#17836
Make a minor update to the 'zpool remove' man page to clarify both
raidz and draid pools do not support removal, and change sector to
ashift which is what we actually care about.
Update the big theory comment in vdev_removal.c to accurately reflect
which types of vdevs can be removed. Furthermore, I've added some
discussion for the casual reader to briefly explain the top-level
vdev removal restrictions. This has been a common area of confusion
and it's not intuitive where they come from without understanding
the implementation details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17847
It's an hrtime_t, which is an unsigned long long. In practice this is
just a U64.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#17833
Otherwise the compiler warns about it on production FreeBSD builds.
The routine proved resilient to attempts to ifdef on debug.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#17818
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes#17793
Originally this was created for MMP, but now new cases are emerging
where the same mechanism is required. Hence the name's generalization.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes#17793
This adds a pause to the ZIO pipeline in the ready stage for
matching I/O (data, dnode, or raw bookmark).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Robert Evans <evansr@google.com>
Closes#17787
This changes the basic search algorithm from a single search up and down
the tree to a full depth-first traversal to handle conditions where the
tree matches at a higher level but not a lower level.
Normally higher level blocks always point to matching blocks, but there
are cases where this does not happen:
1. Racing block pointer updates from dbuf_write_ready.
Before f664f1ee7f (#8946), both dbuf_write_ready and
dnode_next_offset held dn_struct_rwlock which protected against
pointer writes from concurrent syncs.
This no longer applies, so sync context can f.e. clear or fill all
L1->L0 BPs before the L2->L1 BP and higher BP's are updated.
dnode_free_range in particular can reach this case and skip over L1
blocks that need to be dirtied. Later, sync will panic in
free_children when trying to clear a non-dirty indirect block.
This case was found with ztest.
2. txg > 0, non-hole case. This is #11196.
Freeing blocks/dnodes breaks the assumption that a match at a higher
level implies a match at a lower level when filtering txg > 0.
Whenever some but not all L0 blocks are freed, the parent L1 block is
rewritten. Its updated L2->L1 BP reflects a newer birth txg.
Later when searching by txg, if the L1 block matches since the txg is
newer, it is possible that none of the remaining L1->L0 BPs match if
none have been updated.
The same behavior is possible with dnode search at L0.
This is reachable from dsl_destroy_head for synchronous freeing.
When this happens open context fails to free objects leaving sync
context stuck freeing potentially many objects.
This is also reachable from traverse_pool for extreme rewind where it
is theoretically possible that datasets not dirtied after txg are
skipped if the MOS has high enough indirection to trigger this case.
In both of these cases, without backtracking the search ends prematurely
as ESRCH result implies no more matches in the entire object.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Robert Evans <evansr@google.com>
Closes#16025Closes#11196
Provide an interface to retrieve the lowest and highest minimum
allocation size for the normal allocation class. This can be used
by external consumers of the DMU to estimate potential wasted
capacity when setting the recordsize for an object.
The new "min_alloc" and "max_alloc" keys are added to the pool
configuration and used by default_volblocksize() to warn when
an ineffecient block size is requested. For older kmods which
don't yet include the new keys fallback to the previous logic.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17758
The time database update math assumed that the timestamps were in
nanoseconds, but at some point in the development or review process they
changed to seconds. This PR fixes the math to use seconds instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17735
For ABS() to work, the argument must be signed, but rrdd_time is
uint64_t. Clang noticed it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Fixes#16853Closes#17733
In ddt_log_load(), when removing dup entry from flushing tree, it doesn't
free the entry causing memleak.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes#17657Closes#17730
A new `zfs allow` permissions that ONLY allows sending replication
streams in raw (encrypted) mode, so encrypted data will not be
decrypted as part of the replication process.
Sponsored-by: Klara, Inc.
Sponsored-by: Karakun AG
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Co-authored-by: JT Pennington <jt.pennington@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#17543
While it would be nice to be able to scrub a pool imported read-only
this will currently trip an ASSERT. Before we can support this there
are some designs challenges which need to be thought through first.
For starters, a read-only import skips reading certain information
from disk which it knows won't be needed, such as the space maps.
Furthermore, the scrub process expects to be checkpoint it's progress,
update the on disk error log, and issue repair IO. None of which
would be possible when the pool is imported read-only.
Each of these wrinkles can certainly be handled, but that will take
some signifcant work. In the meanwhile we disable the 'zpool scrub'
command when the pool is imported read-only.
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #17527Closes#17717
A single slow responding disk can affect the overall read
performance of a raidz group. When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.
Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.
The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter. Setting it to
zero disables slow outlier detection.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Contributions-by: Don Brady <don.brady@klarasystems.com>
Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17227
If we call ddt_log_load() for legacy ddt, we will end up going into
ddt_log_update_stats() and filling uninitialized value into ddo_dspace.
This value will then get added to dedup_table_size during
ddt_get_dedup_object_stats().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17019Closes#17699
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17690
The concurrent execution of feature_sync() can lead to a panic due
to an unprotected update of the feature refcount. Resolve this by
using the spa->spa_feat_stats_lock to synchronize the update of the
refcount.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#17184Closes#17632
Only used for a couple of debug assertions which had very little value.
Setting it required taking certain locks, so we can remove all that too.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16297Closes#17652Closes#17658
Old debug param, not used for anything.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16297Closes#17652Closes#17658
dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only
existed to ensure that a dnode was clean before making it eligible for
removal from the array of cached dnodes attached to the object 0 L0
dbuf.
dn_dirtycnt is enough to check that now, so use it directly and remove
the rest.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16297Closes#17652Closes#17658
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16297Closes#17652Closes#17658
Bumped when we take the dirty hold in dnode_setdirty(), dropped when the
dnode is finally cleaned up after sync in dnode_rele_task() or
userquota_updates_task().
This gives us a way to check if the dnode is dirty on any txg without
having to rely on outside information (eg presence on a dirty list),
which has been a rich source of bugs in the past.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Suggested-by: Robert Evans <evansr@google.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16297Closes#17652Closes#17658
Since both ZFS- and OS-sides of a zvol now take care of their own
locking and don't get in each other's way, there's no need for the very
complicated removal code to fall back to async tasks if the locks needed
at each stage can't be obtained right now.
Here we change it to be a linear three-step process: select zvols of
interest and flag them for removal, then wait for them to shed activity
and then remove them, and finally, free them.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17625
zvol_state_lock is intended to protect access to the global name->zvol
lists (zvol_find_by_name()), but has also been used to control access to
OS-side private data, accessed through whatever kernel object is used to
represent the volume (gendisk, geom, etc).
This appears to have been necessary to some degree because the OS-side
object is what's used to get a handle on zvol_state_t, so zv_state_lock
and zv_suspend_lock can't be used to manage access, but also, with the
private object and the zvol_state_t being shutdown and destroyed at the
same time in zvol_os_free(), we must ensure that the private object
pointer only ever corresponds to a real zvol_state_t, not one in partial
destruction. Taking the global lock seems like a convenient way to
ensure this.
The problem with this is that zvol_state_lock does not actually protect
access to the zvol_state_t internals, so we need to take zv_state_lock
and/or zv_suspend_lock. If those are contended, this can then cause
OS-side operations (eg zvol_open()) to sleep to wait for them while hold
zvol_state_lock. This then blocks out all other OS-side operations which
want to get the private data, and any ZFS-side control operations that
would take the write half of the lock. It's even worse if ZFS-side
operations induce OS-side calls back into the zvol (eg creating a zvol
triggers a partition probe inside the kernel, and also a userspace
access from udev to set up device links). And it gets even works again
if anything decides to defer those ops to a task and wait on them, which
zvol_remove_minors_impl() will do under high load.
However, since the previous commit, we have a guarantee that the private
data pointer will always be NULL'd out in zvol_os_remove_minor()
_before_ the zvol_state_t is made invalid, but it won't happen until all
users are ejected. So, if we make access to the private object pointer
atomic, we remove the need to take a global lockout to access it, and so
we can remove all acquisitions of zvol_state_lock from the OS side.
While here, I've rewritten much of the locking theory comment at the top
of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've
tried to describe the purpose of each lock in a little more detail, and
in particular describe where it should and shouldn't be used.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17625
When destroying a zvol, it is not "unpublished" from the system (that
is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the
time del_gendisk() and put_disk() are called, the device node may still
be have an active hold, from a userspace program or something inside the
kernel (a partition probe). As it is currently, this can lead to calls
to zvol_open() or zvol_release() while the zvol_state_t is partially or
fully freed. zvol_open() has some protection against this by checking
that private_data is NULL, but zvol_release does not.
This implements a better ordering for all of this by adding a new
OS-side method, zvol_os_remove_minor(), which is responsible for fully
decoupling the "private" (OS-side) objects from the zvol_state_t. For
Linux, that means calling put_disk(), nulling private_data, and freeing
zv_zso.
This takes the place of zvol_os_clear_private(), which was a nod in that
direction but did not do enough, and did not do it early enough.
Equivalent changes are made on the FreeBSD side to follow the API
change.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17625
zvol_remove_minor_impl() and zvol_remove_minors_impl() should be
identical except for how they select zvols to remove, so lets just use
the same function with a flag to indicate if we should include children
and snapshots or not.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17625
Back in 2014 the zfs_autoimport_disable module option was added to
control whether the kmods should load the pool configs from the cache
file on module load. The default value since that time has been for
the kernel to not process the cache file.
Detecting and importing pools during boot is now controlled outside
of the kmod on both Linux and FreeBSD. By all accounts this has been
working well and we can remove this dormant code on the kernel side.
The spa_config_load() function is has been moved to userspace, it is
now only used by libzpool. Additionally, the spa_boot_init() hook
which was used by FreeBSD now looks to be used and was removed.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17618
When ZIL allocates space for new LWBs without knowing how much it
will require, it can use new metaslab_alloc_range() function to
allocate slightly more or less than it predicted. It allows to
improve space efficiency by allocating bigger LWBs on RAIDZ/dRAID
instead of padding and possibly packing more ZIL records there.
It may also allow to reduce ganging in some cases by allowing to
allocate smaller LWBs when we are not sure we'll need bigger.
On the opposite side, when we allocate space for already closed
LWBs, when we precisely know how much space we need, we may just
allocate what we need instead of relying on writing less than
allocated, that does not work for RAIDZ.
Space for LWBs in open state (still being filled) is allocated
same as before.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17613
Physical rewrite patch changed the meaning of BP_GET_BIRTH(), but
I missed update one of its occurences, ending up asserting equal
logical birth times instead of equal physical birth times.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Fixes#17565Closes#17631
Usage zap's (DMU_*USED_OBJECT) are updated in syncing context via
do_userquota_cacheflush(). zap shrink triggers,
ASSERT(db->db_objset == dmu_objset_pool(db->db_objset)->dp_meta_objset
|| txg != spa_syncing_txg(dmu_objset_spa(db->db_objset)));
DMU_*USED_OBJECT are special object (DMU_OBJECT_IS_SPECIAL), gets
updated in syncing context only. So, relax assert for it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#17602
Systems with a large number of CPU cores (192+) may trigger the large
allocation warning in multilist_create() on Linux. Silence the warning
by converting the allocation to vmem_alloc().
On Linux this results in a call to kvalloc() which will alloc vmem
for large allocations and kmem for small allocations.
On FreeBSD both vmem_alloc and kmem_alloc internally use the same
allocator so there is no functional change.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17616
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17622
Make sure we properly inform the nolwb waiters of the error, and don't
keep trying.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17622
Just making it easier to not get the locking and broadcast wrong.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17622
If the ZIL crashed, any outstanding LWBs are no longer interesting, so
if they return, we need to just clean them up and return, not try to do
any work on them. This is true even if they return success, as that may
be long after the pool suspended and resumed, depending on when/if the
kernel decides to return the IO to us. In particular, we must not try to
get the "next" LWB from zl_lwb_list, since they're no longer on that
list.
So, we put a flag on in-flight LWBs in zil_crash() when we move them
from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when
they return.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17622
I'm soon about to need another LWB flag, and boolean_t is just so big
for only storing a single bit. Changing to a bitfield is far less
wasteful.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17622
This is trying to get all the uses and non-uses of SET_ERROR correct
(being: only call it if we're the originator of an error _within ZFS_),
and correctly negating errors going to/from the kernel. And/or both.
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17605
The vast majority of calls to zil_commit() follow VFS ops, and should
honour the failmode= setting - either wait for sync, or return error.
Some calls however are part of a larger syncing op, and shouldn't ever
block if something goes wrong.
To allow this, we introduce zil_commit_flags(), with a flag
ZIL_COMMIT_FAILMODE to indicate whether or not the pool failmode should
be honoured. zil_commit() is now a wrapper that always sets this flag,
but any caller wanting a different behaviour can request ZIL_COMMIT_NOW
instead to have the call return failure if the pool suspends, regardless
of the failmode= setting.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398
If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks
on suspend. We want it to not block on suspend, instead returning an
error. On the surface, this is simple: change all calls to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error
return back to the zil_commit() caller.
Handling suspension means returning an error to all commit waiters. This
is relatively straightforward, as zil_commit_waiter_t already has
zcw_zio_error to hold the write IO error, which signals a fallback to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the
waiter can now return an error from zil_commit().
However, commit waiters are normally signalled when their associated
write (LWB) completes. If the pool has suspended, those IOs may not
return for some time, or maybe not at all. We still want to signal those
waiters so they can return from zil_commit(). We have a list of those
in-flight LWBs on zl_lwb_list, so we can run through those, detach them
and signal them. The LWB itself is still in-flight, but no longer has
attached waiters, so when it returns there will be nothing to do.
(As an aside, ITXs can also supply completion callbacks, which are
called when they are destroyed. These are directly connected to LWBs
though, so are passed the error code and destroyed there too).
At this point, all ZIL waiters have been ejected, so we only have to
consider the internal state. We potentially still have ITXs that have
not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL
is in an unknown state; some writes may have been written but not
returned to us. We really can't rely on any of it; the best thing to do
is abandon it entirely and start over when the pool returns to service.
But, since we may have IO out that won't return until the pool resumes,
we need something for it to return to.
The simplest solution I could find, implemented here, is to "crash" the
ZIL: accept no new ITXs, make no further updates, and let it empty out
on its normal schedule, that is, as txgs complete and zil_sync() and
zil_clean() are called. We set a "restart txg" to three txgs in the
future (syncing + TXG_CONCURRENT_STATES), at which point all the
internal state will have been cleared out, and the ZIL can resume
operation (handled at the top of zil_clean()).
This commit adds zil_crash(), which handles all of the above:
- sets the restart txg
- capture and signal all waiters
- zero the header
zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND)
returns because the pool suspended (ESHUTDOWN).
The rest of the commit is just threading the errors through, and related
housekeeping.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398
ITX callbacks are used to signal that something can be cleaned up after
a itx is committed. Presently that's only used when syncing out mapped
pages (msync()) to mark dirty pages clean.
This extends the callback interface so it can be passed an error, and
take a different cleanup action if necessary.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17398