As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts. This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect. This extra
safety chceck operates as follows:
1. Repeats the following MMP check 10 times:
a. Write out an MMP uberblock with the best txg and a random
sequence id to all primary pool vdevs.
b. Verify a minimum number of good writes such that even if
the pool appears degraded on the remote host it will see
at least one of the updated MMP uberblocks.
c. Wait for the MMP interval this leaves a window for other
racing hosts to make similar modifications which can be
detected.
d. Call vdev_uberblock_load() to determine the best uberblock
to use, this should be the MMP uberblock just written.
e. Verify the txg and random sequeunce number match the MMP
uberblock written in 1a.
2. Restore the original MMP uberblocks. This allows the check
to be performed again if the pool fails to import for an
unrelated reason.
This change also includes some refactoring and minor improvements.
- Never try loading earlier txgs during import when the import
fails with EREMOTEIO or EINTER. These errors don't indicate
the txg is damaged but instead that its either in use on a
remote host or the import was interactively cancelled. No
rewind is also performed for EBADD which can result from a
stale trusted config when doing a verbatim import.
- Refactor the code for consistent logging of the multihost
activity check using spa_load_note() and console messages
indicating when the activity check was trigger and the result.
- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
modification of the sequence number in an uberblock.
- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
set to log to dump to stdout the spa_load_info nvlist returned
during import. This is used by the updated mmp test cases
to determine if an activity check was run and its result.
- Standardize the mmp messages similarly to make it easier to
find all the relevent mmp lines in the debug log.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Tryimport adds a unique prefix to the pool name to avoid name
collisions. This makes it awkward to log user-friendly info
during a tryimport. Add a spa_load_name() function which can
be used to report the unmodified pool name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check. Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
This ensures that the in-memory state of the feature is recorded and
that `dsl_dataset_activate_feature` is not called when the feature
is already active.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18143Closes#18144
To avoid read errors with transaction open dmu_tx_check_ioerr()
is used to read everything required in advance. But there seems
to be a chance for the buffer to evicted from dbuf cache in
between, which result in immediate eviction from ARC, which may
require additional disk read later in a place where error handling
is problematic.
To partially workaround this introduce a new flag DMU_IS_PREFETCH,
relayed to ARC as ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH,
making ARC delay eviction by at least several seconds, or till the
actual read inside the transaction, that will promote it to demand
access.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18160
These fields became unused when ABD was introduced in a6255b7fc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write
speeds and protect SSD endurance. Write rate is constrained by the
minimum of l2arc_write_max and DWPD-calculated budget. Devices
accumulate unused write budget over 24-hour periods with automatic reset
and carry-over. Writes occur in controlled bursts (max 50MB) with
adaptive intervals to achieve target rates. Applies after initial device
fill.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
Transform L2ARC from single global feed thread to per-device threads,
enabling parallel writes to multiple L2ARC devices. Each device runs
its own feed thread independently, improving multi-device throughput.
Previously, a single thread served all devices sequentially; now each
device writes concurrently. Threads are created during device addition
and torn down on removal.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
When arc_release() is called on a header with a single buffer and
L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar
to the arc_hdr_destroy() case). If we destroy the L2HDR here, later
arc_write() will allocate a new ABD and call arc_hdr_free_abd(),
which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing
VERIFY(HDR_HAS_L2HDR(hdr)) to fail.
Allocate a new header for the buffer in the single_buf_l2writing
case (single buffer + L2_WRITING), leaving the original header with
L2HDR intact. The original header becomes an "orphan" (no buffers, no
b_pabd) but retains device association for ABD cleanup when
l2arc_write_done() completes.
The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC
makes its own transformed copy via l2arc_apply_transforms(), so the
original ABD is not used by the L2 write. The header can be safely
reused without allocating a new one.
For proper evictable space accounting, arc_buf_remove() must be
called before remove_reference() in the single_buf_l2writing path.
This ensures arc_evictable_space_increment() (during remove_reference)
and arc_evictable_space_decrement() (during destruction) see the
same state (b_buf=NULL), preventing accounting leaks that cause
module unload to hang with non-zero esize.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
With multiple L2ARC devices, headers can be destroyed asynchronously
(e.g., during zpool sync) while L2_WRITING is set. The original code
destroyed L2HDR before L1HDR, causing ABDs to lose their device
association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called.
This caused ABDs to be added to the global free-on-write list without
device information. When any L2ARC device completed its write and
attempted to free these orphaned ABDs, it would panic on
ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was
still part of another device's vdev_queue I/O aggregation gang.
Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and
reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This
ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag
ABDs with their device for deferred cleanup.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
This commit introduces per-sublist persistent markers that eliminate
redundant tail scanning between L2ARC iterations, providing significant
CPU efficiency improvements. Markers are pre-allocated during device
initialization and properly cleaned up during device removal.
The implementation uses conditional behavior based on device capacity:
small devices (capacity < arc_c) retain original HEAD/TAIL scanning
based on ARC warmup state, while large devices (capacity >= arc_c)
use the persistent marker approach for optimal CPU efficiency.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
The introduction of ARC multilists made L2ARC writing quite random,
depending on whether it found something to write in a randomly selected
sublist. This created inconsistent write patterns and poor utilization
of available sublists leading to uneven cache population.
This commit replaces random selection with systematic scanning across
all sublists within each burst. Fair headroom distribution ensures
even-depth traversal across all sublists until the target write size
is reached. Round-robin processing with random starting points eliminates
sequential bias while maintaining predictable write behavior.
The systematic approach provides consistent L2ARC filling patterns
and better utilization of available ARC data across all sublists.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
The final txgs are used only to clear out any remaining deferred
frees, and we cannot write new data to them. Make sure we do not
try to do so.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes#18139
* Lock db_mtx around arc_release() in dbuf_release_bp()
While this function is called only in sync context, the same buffer
can be touched by dbuf_hold_impl() in open context, creating races.
All other accesses to arc_release() are already protected by db_mtx,
so just take it here too.
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
* Lock db_mtx in sa_byteswap()
While SA code seems protected by sa_lock, there is a back door of
dmu_objset_userquota_get_ids(), that may hold and access the dbuf
without sa_lock, relying only on db_mtx. Taking db_mtx here should
protect both the arc_release() and the data for db_buf.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18146
`zpool create` is supposed to log the command to the new pool’s history,
as a special record that never gets evicted from the ring buffer. but
when you create a pool with `zpool create -t`, no such record is ever
logged (#18102). that bug may be the cause of issues like #16408.
`zpool create -t` (83e9986f6e) and `zpool
import -t` (26b42f3f9d) are both designed
to override the on-disk zpool property `name` with an in-core
“temporary” name, but they work somewhat differently under the hood.
importing with a temporary name sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` in ZFS_IOC_POOL_IMPORT, which tells
spa_write_cachefile() and spa_config_generate() to use the
ZPOOL_CONFIG_POOL_NAME in `spa->spa_config` instead of `spa->spa_name`.
creating with a temporary name permanently(!) sets the internal zpool
property `tname` (ZPOOL_PROP_TNAME) in the `zc->zc_nvlist_src` of
ZFS_IOC_POOL_CREATE, which tells zfs_ioc_pool_create()
(4ceb8dd6fd) and spa_create() to use that
name instead of `zc->zc_name`, then sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` like an import.
but zfsdev_ioctl_common() fails to check for `tname` when saving the
pool name to `zfs_allow_log_key`, so when we call ZFS_IOC_LOG_HISTORY,
we call spa_open() on the wrong pool name and get ENOENT, so the logging
silently fails.
this patch fixes#18102 by checking for `tname` in zfsdev_ioctl_common()
like we do in zfs_ioc_pool_create().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: delan azabani <dazabani@igalia.com>
Closes#18118Closes#18102
Similar to BRT, DDT ZAP can be destroyed by sync context when it
becomes empty. Respectively similar to BRT introduce RW-lock to
protect open context methods from the destruction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18115
ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS
to indicate the presence of large blocks in the dataset. On the sending
side, this flag is included if the `-L` flag is passed to `zfs send`
and the feature is active in the dataset. On the receive side, the
stream is refused if the feature is active in the destination dataset
but the stream does not include the feature flag.
The problem is the feature is only activated when a large block is
born. If a large block has been born in the destination, but never
the source, the send can't work. This can arise when sending streams
back and forth between two datasets.
This commit fixes the problem by always activating the large blocks
feature when receiving a stream with the large block feature flag.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18105
Add a read-only dataset property, snapshots_changed_nsecs, which
exposes the nanosecond resolution version of snapshots_changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Wolfgang Hoschek <wolfgang.hoschek@mac.com>
Closes#17998Closes#18031
Now that it's built into the main zfs module in all cases, there's no
reason to put it in its own dir.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#18071
Scan time limits do not need precision beyond 1ms. Switching
scn_sync_start_time and spa_sync_starttime from gethrtime() to
getlrtime() saves ~3% of CPU time during resilver scan stage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18061
With higher throughput and lower latency of modern devices ZFS can
happily live with pretty short (fractions of a second) TXGs. But
the two decade old multi-second minimal time limits can almost stop
payload writes by extending TXGs beyond dirty data limits of ARC
ability to amortize it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18060
This improves synthetic 1 byte write speed by ~2.5%.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18063
There were some per I/O logging into dbgmsg in RAIDZ code, that
increased CPU load and wiped useful content out of dbgmsg, for
example during routine disk replacement process. I don't think
we need it to be that verbose.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18059
The first byte of the entry after compression is used for algorithm
and byte order flag. We should decrement when calling compression/
decompression algorithm.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18055
Unlike other ZAP consumers due to compression DDT does not know
how big entry it is reading from ZAP. Due to this it called
zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(),
each of which does full ZAP entry lookup.
Introduction of the combined ZAP method dramatically reduces the
CPU overhead and locks contention at DBUF layer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18048
As was previously done for BRT, avoid holding/releasing DDT ZAP
dnodes for every access. Instead hold the dnodes during all their
life time, never releasing.
While at this, add _by_dnode() interfaces for zap_length_uint64()
and zap_count(), actively used by DDT code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18047
Postponing entry removal from the DDT log in case of hit till later
single-threaded sync stage allows to make ddl_tree stable during
multi-threaded ZIO processing stage. It allows to drop the DDT lock
before the search instead of after, reducing the contention a lot.
Actually ddt_log_update_entry() was already handling the case of
entry present in the active log, so we only need to remove it from
flushing log, if the entry happen to be there.
My tests with parallel 4KB block writes show throughput increase
from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even
though still limited by the global DDT lock contention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18044
Previous code effectively enforced that all async free ZIOs were
_issued_ within the TXG timeout. But they could take forever to
complete, especially if the required metadata were not in ARC.
This patch introduces periodic waits every 2000 ZIOs, which should
give at least somewhat reasonable TXG timings even for single HDD
pools with empty ARC. And makes them complete within half of the
TXG timeout, since we might still need time to sync DDT and BRT.
While there, change zfs_max_async_dedup_frees semantics to include
also clone and gang blocks, which are similar. Bump the default
value from set long ago to be more forgiving to block cloning
(still not having logs and benefiting from large TXGs), now that
we have better working time limits. The limit now is a possible
amount of dirty data produced by BRT updates.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18043
We've observed a number of cases when pool import stuck for many
minutes due to large async destroy trying to load DDT or BRT from
HDD pool. While proper destroy dosage is a separate problem,
lets give import process a chance to complete before that at all.
It may be not enough if there is a lot of ZIL to replay, but that
is harder to cover, since those are in separate syscalls.
Code investigation shown that we already have this mechanism used
for scrub/resilver, so this patch converts SCAN_IMPORT_WAIT_TXGS
into a tunable and applies it to async destroys also.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18033
Before parallel eviction implementation zfs_arc_evict_batch_limit
caused loop exits after evicting 10 headers. The cost of it is not
big and well motivated. Now though taskq task exit after the same
10 headers is much more expensive. To cover the context switch
overhead of taskq introduce another level of batching, controlled
by zfs_arc_evict_batches_limit tunable, used only for parallel
eviction.
My tests including 36 parallel reads with 4KB recordsize that shown
1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention,
now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17970
In zio_ddt_free, if a pruned dde is still in ddt, it would do nothing
and cause space leak.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#17982Closes#17983
There is no need to do MSEC_TO_TICK() for each evicted ARC header.
We can do it when tunables are set, since we already have separate
internal variables for those.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17965
For each block written or freed ZFS dirties ds_dbuf of the dataset.
While dbuf_dirty() has a fast path for already dirty dbufs, it still
require taking the lock and doing some things visible in profiler.
Investigation shown ds_dbuf dirtying by dsl_dataset_block_born()
and some of dsl_dataset_block_kill() are just not needed, since
by the time they are called in sync context the ds_dbuf is already
dirtied by dsl_dataset_sync().
Tests show this reducing large file deletion time by ~3% by saving
CPU time of single-threaded part of the sync thread.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18028
ZFS typically preserves proper LIFO ordering regarding map/unmap
operations that wrap the Linux kernel's kmap interfaces that
require such ordering, but one instance in abd_raidz_gen_iterate()
did not.
Similar issues have been fixed in the Linux kernel in the past,
see for instance CVE-2025-39899 for userfaultfd.
Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes#15668Closes#18030
A deadlock occurs when snapshot expiry tasks are cancelled while holding
locks. The snapshot expiry task (snapentry_expire) spawns an umount
process and waits for it to complete. Concurrently, ARC memory pressure
triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the
expiry task while holding locks. The umount process spawned by the
expiry task blocks trying to acquire locks held by arc_prune, which is
blocked waiting for the expiry task to complete. This creates a circular
dependency: expiry task waits for umount, umount waits for arc_prune,
arc_prune waits for expiry task.
Fix by adding non-blocking cancellation support to taskq_cancel_id().
The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to
reschedule the unmount, which needs to cancel any existing expiry task.
It now uses non-blocking cancellation to avoid waiting while holding
locks, breaking the deadlock by returning immediately when the task is
already running.
The per-entry se_taskqid_lock has been removed, with all taskqid
operations now protected by the global zfs_snapshot_lock held as
WRITER. Additionally, an se_in_umount flag prevents recursive waits when
zfsctl_destroy() is called during unmount. The taskqid is now only
cleared by the caller on successful cancellation; running tasks clear
their own taskqid upon completion.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17941
Before this change DDT lock was taken 4 times per written block,
and as effectively a pool-wide lock it can be highly congested.
This change introduces a new per-entry dde_io_lock, protecting some
fields during I/O ready and done stages, so that we don't need the
global lock there.
According to my write tests on 64-thread system with 4KB blocks this
significantly reduce the global lock contention, reducing CPU usage
from 100% to expected ~80%, and increasing write throughput by 10%.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17960
ddt_lookup() is a very busy code under a highly congested global
lock. Anything we can save here is very important.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17980
Even though unlike gang children it is not so critical for dedup
children to inherit parent's allocator, there is still no reason
for them to have allocation policy different from normal writes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17961
Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that
allows users to disable notifications for slow devices.
This prevents ZED and/or ZFSD from degrading the pool due to slow
I/O.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Closes 17477
Implement BRT (Block Reference Table) prefetch functionality similar
to existing DDT prefetch. This allows preloading BRT metadata into
ARC to improve performance for block cloning operations and frees
of earlier cloned blocks.
Make -t parameter optional. When omitted, prefetch all supported
metadata types (both DDT and BRT now).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17890
According to my observations, BRT ZAPs are typically compressible
3:1 for data and 2:1 for indirects. With ashift=12, typical these
days, it means increasing the block sizes to 8KB we may get most
of possible compression, reducing on-disk and in-ARC BRT footprint
in half by the cost of some compression/decompression overhead,
but without real write inflation, only some dirty data increase.
Increase to 32KB similar to DDT could further increase compression
and storage efficiency, but at the cost of write inflation and
much bigger dirty data increase, which we can not properly control
now. So lets leave this for a time when BRT log gets implemented.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17916
dmu_object_info_from_dnode() takes two locks and copies plenty of
data that we don't need in zap_lockdir_impl(). Just read dn_type
directly in this hot path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17921
This is useful as debugging support, as it lets namespace lock
operations be traced directly. It will also be useful for future work to
reduce the use of spa_namespace_lock, traditionally a source of
difficult deadlocks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17906
Free issue threads might block waiting for synchronous DDT, BRT or
GANG header reads. So unlike other taskqs using ZTI_SCALE to scale
with number of CPUs, here we also need some amount of threads to
potentially saturate pool reads. I am not sure we always want the
96 threads we had before ZTI_SCALE introduction at #11966 on small
systems, but lets make it at least 32.
While here, make free taskqs configurable, similar to read and
write ones.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17903
Change the spelling of "begining" on line 4875 to
"beginning".
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com>
Closes#17905
When a write comes in via dmu_sync_late_arrival, its txg is equal to the
open TXG. If that write gangs, and we have not yet activated the new
gang header feature, and the gang header we pick can store a larger gang
header, we will try to schedule the upgrade for the open TXG + 1. In
debug mode, this causes an assertion to trip. This PR sets the TXG for
activating the feature to be the larger of either the current open TXG
or the syncing TXG + 1.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17824