Added vdev property to disable the vdev scheduler.
The intention behind this property is to improve IOPS
performance when using o_direct.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes#17358
Break out the range_tree, btree, and highbit64/lowbit64 code from kernel
space into shared kernel and userspace code. This is needed for the
updated `zpool status -vv` error byte range reporting that will be
coming in a future commit. That commit needs the range_tree code in
kernel and userspace.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18133
The upcoming 7.0 kernel will no longer fall back to generic_setlease(),
instead returning EINVAL if .setlease is NULL. So, we set it explicitly.
To ensure that we catch any future kernel change, adds a sanity test for
F_SETLEASE and F_GETLEASE too. Since this is a Linux-specific test,
also a small adjustment to the test runner to allow OS-specific helper
programs.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18215
There's no real need to outright crash if key loading fails; we can
just unwind nicely.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18230
Previously using -K/--key on an unencrypted dataset would trip a VERIFY,
because the dataset has nowhere to load the key into.
Now, just ignore it. This makes zdb much easier to drive when there's a
mix of encrypt and non-encrypted datasets, as the key can provided for
all of them (at least, assuming the same encryption root, which is a
common enough case).
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18230
Because it's called in $(...), it will swallow all errors, so we have to
work harder to recognise falure and echo a string that can't ever match
what the test is expecting.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18230
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18232
If a CPU has variable frequency, then lscpu will list separate "CPU min
freq" and "CPU max freq" values. In this case, take the maximum.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18232
Currently, spa_dspace (base to calculate dataset AVAIL) only includes
the normal allocation class capacity, but dd_used_bytes tracks space
allocated across all classes. Since we don't want to report free
space of other classes as available (we can't promise new allocations
will be able to use it), report only allocated space, similar to how
we report space saved by dedup and block cloning.
Since we need deflated space here, make allocation classes track
deflated allocated space also. While here, make mc_deferred also
deflated, matching its use contexts. Also while there, use
atomic_load() to read the allocation class stats.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18190Closes#18222
ZFS can be built directly into the Linux kernel. Add a test build
of this to the CI to verify it works. The test build is only enabled
on Fedora runners (since they run the newest kernels) and is done in
parallel with ZTS. The test build is done on vm2, since it typically
finishes ~15min before vm1 and thus has time to spare.
In addition:
- Update 'copy-builtin' to check that $1 is a directory
- Fix some VERIFYs that were causing the built-in build to fail
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18234
In the AES-GCM assembly files we are defining Gmul functions we
don't use anywhere.
Just remove the dead code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#18226
I am not sure why it was added there 10 years ago, but it seems not
needed now. According to my tests removing it improves sequential
read performance with recordsize=4K by 5-10% by reducing the CPU
overhead in prefetcher.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18214
Require AVX2 compiler support and document source files for
`aesni-gcm-avx2-vaes.S`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#18225
Once upon a time, 32-bit PowerPC did indeed have a 32-bit time_t, but
FreeBSD 12.0 switched to a 64-bit time_t for PowerPC as an ABI break,
which predates the addition of FreeBSD support to OpenZFS. Moreover,
64-bit PowerPC has existed since FreeBSD 9.0, where __powerpc__ is also
defined (alongside __powerpc64__ to disambiguate), which has always had
a 64-bit time_t. This code has therefore always been wrong for all
PowerPC variants. Fix this by limiting the 32-bit case to just i386,
which is the only architecture in FreeBSD to have a 32-bit time_t and
not have broken ABI, due to its special legacy compatibility status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Jessica Clarke <jrtc27@jrtc27.com>
Closes#18217Closes#18218
Linux 6.19 added an AES-GCM VAES-AVX2 assembly implementation. It's
basically a translation from the BoringSSL perlasm syntax to macro
assembly. We're using the same source but the perlasm generated flat
assembly which shares some global function names with the former.
When building in-tree this results in the linker failing due to the
duplicate symbols.
To avoid the error we prepend `icp_` via a macro to our function
names.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Moch <mail@alexmoch.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#18204Closes#18224
We should not dereference through dn_handle->dnh_dnode once we
already have a dnode pointer. The result will be the same.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18212
- For multilevel gang blocks it seemed possible to fallback from
normal to special class, since they don't have proper object type,
and DMU_OT_NONE is a "metadata". They should never fallback.
- Fix possible inversion with zfs_user_indirect_is_special = 0,
when indirects written to normal vdev, while small data to special.
Make small indirect blocks also follow special_small_blocks there.
- With special_small_blocks now applying to both files and ZVOLs,
make it apply to all non-metadata without extra checks, since there
are no other non-metadata types.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18208
This change modifies the behavior of spa_sync_time_logger when
flushing the RRD database.
Previously, once the sync interval elapsed, a flush would always
be generated. On solid-state devices, especially when the pool was
otherwise idle, this caused disks to wake up solely to write RRD
data. Since RRD is best-effort telemetry, this behavior is
unnecessary and wasteful.
With this change, spa_sync_time_logger delays flushing until a TXG
that already contains data is being synced. The RRD update is
appended to that TXG instead of forcing the creation of
a new write-only TXG.
During pool export, flushing is forced regardless of whether
the TXG contains user data. At that stage, data durability takes
precedence and a write must be issued.
Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes#18082Closes#18138
When performing an incremental raw send with intermediates (-w -I),
the standard 'send' permission was incorrectly required instead of
allowing 'send:raw'. This was due to a strict boolean comparison on
the 'rawok' flag in zfs_secpolicy_send() with non-boolean value.
This change normalizes the 'rawok' variable to be strictly 0/1 and
updates the test suite to properly verify delegated raw send behavior.
Introduced-by: https://github.com/openzfs/zfs/pull/17543
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Marc Sladek <marc@sladek.dev>
Closes#18198Closes#18193
Wait for scrub_finish (as the comments in the code suggest) rather
than trim_finish in zed_synchronous_zedlet.ksh. This seems to
workaround the ZTS failures in #18192. Also, fix some typos.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18192Closes#18196
Update the META file to reflect compatibility with the 6.19
kernel.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18197
When zpool create fails because a vdev cannot be opened (ENXIO),
the error falls through to zpool_standard_error() which reports
the generic 'one or more devices is currently unavailable'. This
is misleading when the real cause is a block size mismatch or
other device open failure.
Add an explicit ENXIO case in zpool_create()'s error handling to
provide a more descriptive message.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18184Closes#11087
The Lustre filessytem calls a number of exported ZFS functions. Do a
test build on the Almalinux runners to make sure we're not breaking
Lustre. We do the Lustre build in parallel with the normal ZTS test
for efficiency, since ZTS isn't very CPU intensive. The full Lustre
build takes around 15min when run on its own.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18161
While technically its not a problem to clone between datasets with
different properties, it might create expectation of new properties
being applied during data move, while actually it won't happen.
For copies and checksum it may mean incorrect safety expectations.
For dedup, compression and special_small_blocks -- performance and
space usage. New zfs_bclone_strict_properties tunable controls it.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18180
Without this patch, the following crash can occur when
a file system is configured with "xattr=dir".
VNASSERT failed: locked not true at
/posix-acl/freebsd-rdma/sys/kern/vfs_subr.c:5786 (assert_vop_locked)
hold count flags ()
flags ()
lock type zfs: UNLOCKED
panic: zfs_dirent_lookup: vnode is not locked but should be
cpuid = 3
time = 1770520763
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b
vpanic() at vpanic+0x136/frame 0xfffffe00914c8270
panic() at panic+0x43/frame 0xfffffe00914c82d0
assert_vop_locked() at assert_vop_locked+0x78
zfs_dirent_lookup() at zfs_dirent_lookup+0x41
zfs_setattr_dir() at zfs_setattr_dir+0x123
zfs_setattr() at zfs_setattr+0x1389
zfs_freebsd_setattr() at zfs_freebsd_setattr+0x56b
VOP_SETATTR_APV() at VOP_SETATTR_APV+0x5d
setfown() at setfown+0xb1
kern_fchownat() at kern_fchownat+0x192
This patch fixes the problem by moving the vput() call for
attrzp to after the zfs_setattr_dir() call that takes it as
an argument.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes: #18188
Because the `strerror` result doesn't include a newline, we need to add
one. Observed on a minimal system that doesn't have `man` installed,
which behaves like this before the fix:
```
[root@upper tim]# zpool help import
couldn't run man program: No such file or directory[root@upper tim]#
```
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Hatch <tim@timhatch.com>
Closes#18183
Rewrite of cloned and snapshotted blocks can allocate additional
space, that may be undesired. In some cases it may have sense
to still rewrite snapshotted blocks, expecting the snapshots to
rotate with time, freeing space. In other cases rewrite of cloned
blocks may be acceptable, despite persistent space usage increase.
For this reason add them as separate flags to `zfs rewrite`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18179
"Welcome to my house! Enter freely. Go safely, and leave something of
the happiness you bring!"
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#18189
- mmp_concurrent_import: added test case to verify that concurrent
import correctness. The pool may only be imported once.
- mmp_exported_import: an activity check is now required for pools
which were cleanly exported if the system and pool hostids don't
match.
- mmp_inactive_import: an activity check is now required for any
pool which wasn't cleanly exported, even if the system and pool
hostids match.
- mmp_on_uberblocks: updated expected uberblocks to take in to account
the value MMP_INTERVAL_DEFAULT is set too.
- mmp_reset_interval: reduce the number of iterations from 10 to 3.
This is sufficient to verify functionality and significantly speeds
up the test.
- mmp_on_uberblocks: adjust the thresholds and increase the runtime
to avoid false positives observed in CI.
- Update tests to use 'zhack action idle' instead of ztest to improve
the reliability of the tests.
- Add additional log_note messages to test cases which have multiple
verification steps to make it clear which portion of a test failed
when reviewing the logs.
- Replace default_setup/cleanup_noexit calls with 'zpool create' and
'zpool destroy' calls to avoid additional unnecessary dataset
creation work.
- Update activity/noactivity check helper functions to use the
ZFS_LOAD_INFO_DEBUG information now available from 'zpool import'
to determine if this activity check ran and why. This is more
reliable in the CI than measuring the runtime.
- Removed all mmp tests from the zts-report.py exceptions list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
In order to reliably test the multihost protection we need two (or more)
systems attempting to import the pool at the same time. Historically, we've
used ztest running in userspace to simulate an active pool and attempted to
import the pool with the kernel modules. This works but ztest is a bit
unwieldy for this and if it crashes for unrelated reasons it can result
in false positives.
All we really need is the pool imported in userspace so the MMP thread is
active and writing out uberblocks. We can extend zhack which already knows
how to import the pool read/write and add an option to leave the pool open
and idle.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Add a -G option to zhack to dump the internal debug buffer on exit.
We were able to use the same code from zdb for this which was nice.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts. This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect. This extra
safety chceck operates as follows:
1. Repeats the following MMP check 10 times:
a. Write out an MMP uberblock with the best txg and a random
sequence id to all primary pool vdevs.
b. Verify a minimum number of good writes such that even if
the pool appears degraded on the remote host it will see
at least one of the updated MMP uberblocks.
c. Wait for the MMP interval this leaves a window for other
racing hosts to make similar modifications which can be
detected.
d. Call vdev_uberblock_load() to determine the best uberblock
to use, this should be the MMP uberblock just written.
e. Verify the txg and random sequeunce number match the MMP
uberblock written in 1a.
2. Restore the original MMP uberblocks. This allows the check
to be performed again if the pool fails to import for an
unrelated reason.
This change also includes some refactoring and minor improvements.
- Never try loading earlier txgs during import when the import
fails with EREMOTEIO or EINTER. These errors don't indicate
the txg is damaged but instead that its either in use on a
remote host or the import was interactively cancelled. No
rewind is also performed for EBADD which can result from a
stale trusted config when doing a verbatim import.
- Refactor the code for consistent logging of the multihost
activity check using spa_load_note() and console messages
indicating when the activity check was trigger and the result.
- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
modification of the sequence number in an uberblock.
- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
set to log to dump to stdout the spa_load_info nvlist returned
during import. This is used by the updated mmp test cases
to determine if an activity check was run and its result.
- Standardize the mmp messages similarly to make it easier to
find all the relevent mmp lines in the debug log.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Tryimport adds a unique prefix to the pool name to avoid name
collisions. This makes it awkward to log user-friendly info
during a tryimport. Add a spa_load_name() function which can
be used to report the unmodified pool name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check. Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
This ensures that the in-memory state of the feature is recorded and
that `dsl_dataset_activate_feature` is not called when the feature
is already active.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18143Closes#18144
As per #9008, pyzfs implements and documents several functions that
would be very useful, but then try to call c functions in libzfs_core.
These functions do not exist in libzfs_core, and in the ~7 years of
ticket creation still do not exist in libzfs_core.
It seems unlikely that these functions will get implemented, though 2
years ago, ~5 years after that ticket lzc_get_props was implemented in
23a489a411 which enabled get properties in
pyzfs. Sadly the first thing the pyzfs function for lzc_get_props does
is call _list, which cals lzc_list, which is not implmented. And the
functions to set or inherit properties are still missing.
Having these functions in pyzfs are misleading, footguns, and time
wasters when evaluating pyzfs.
Removing these functions from pyzfs means that _if_ these functions are
added in libzfs_core, then pyzfs will also need to re-implement these
functions. It's a shame, because these py functions have good
documentation and tests. Funny enough the tests are auto skipped if it
detects that the functions don't exist in libzfs_core.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: George Shammas <george@shamm.as>
Closes#9008Closes#18162
To avoid read errors with transaction open dmu_tx_check_ioerr()
is used to read everything required in advance. But there seems
to be a chance for the buffer to evicted from dbuf cache in
between, which result in immediate eviction from ARC, which may
require additional disk read later in a place where error handling
is problematic.
To partially workaround this introduce a new flag DMU_IS_PREFETCH,
relayed to ARC as ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH,
making ARC delay eviction by at least several seconds, or till the
actual read inside the transaction, that will promote it to demand
access.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18160
These fields became unused when ABD was introduced in a6255b7fc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
Add l2arc_dwpd_limit, remove l2arc_write_boost, update related tunables.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
Add DWPD (Drive Writes Per Day) rate limiting to control L2ARC write
speeds and protect SSD endurance. Write rate is constrained by the
minimum of l2arc_write_max and DWPD-calculated budget. Devices
accumulate unused write budget over 24-hour periods with automatic reset
and carry-over. Writes occur in controlled bursts (max 50MB) with
adaptive intervals to achieve target rates. Applies after initial device
fill.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
Transform L2ARC from single global feed thread to per-device threads,
enabling parallel writes to multiple L2ARC devices. Each device runs
its own feed thread independently, improving multi-device throughput.
Previously, a single thread served all devices sequentially; now each
device writes concurrently. Threads are created during device addition
and torn down on removal.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
When arc_release() is called on a header with a single buffer and
L2_WRITING set, the L2HDR must be preserved for ABD cleanup (similar
to the arc_hdr_destroy() case). If we destroy the L2HDR here, later
arc_write() will allocate a new ABD and call arc_hdr_free_abd(),
which needs b_l2hdr.b_dev to properly defer ABD cleanup, causing
VERIFY(HDR_HAS_L2HDR(hdr)) to fail.
Allocate a new header for the buffer in the single_buf_l2writing
case (single buffer + L2_WRITING), leaving the original header with
L2HDR intact. The original header becomes an "orphan" (no buffers, no
b_pabd) but retains device association for ABD cleanup when
l2arc_write_done() completes.
The shared buffer case (HDR_SHARED_DATA) is excluded because L2ARC
makes its own transformed copy via l2arc_apply_transforms(), so the
original ABD is not used by the L2 write. The header can be safely
reused without allocating a new one.
For proper evictable space accounting, arc_buf_remove() must be
called before remove_reference() in the single_buf_l2writing path.
This ensures arc_evictable_space_increment() (during remove_reference)
and arc_evictable_space_decrement() (during destruction) see the
same state (b_buf=NULL), preventing accounting leaks that cause
module unload to hang with non-zero esize.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
With multiple L2ARC devices, headers can be destroyed asynchronously
(e.g., during zpool sync) while L2_WRITING is set. The original code
destroyed L2HDR before L1HDR, causing ABDs to lose their device
association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called.
This caused ABDs to be added to the global free-on-write list without
device information. When any L2ARC device completed its write and
attempted to free these orphaned ABDs, it would panic on
ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was
still part of another device's vdev_queue I/O aggregation gang.
Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and
reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This
ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag
ABDs with their device for deferred cleanup.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
This commit introduces per-sublist persistent markers that eliminate
redundant tail scanning between L2ARC iterations, providing significant
CPU efficiency improvements. Markers are pre-allocated during device
initialization and properly cleaned up during device removal.
The implementation uses conditional behavior based on device capacity:
small devices (capacity < arc_c) retain original HEAD/TAIL scanning
based on ARC warmup state, while large devices (capacity >= arc_c)
use the persistent marker approach for optimal CPU efficiency.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
The introduction of ARC multilists made L2ARC writing quite random,
depending on whether it found something to write in a randomly selected
sublist. This created inconsistent write patterns and poor utilization
of available sublists leading to uneven cache population.
This commit replaces random selection with systematic scanning across
all sublists within each burst. Fair headroom distribution ensures
even-depth traversal across all sublists until the target write size
is reached. Round-robin processing with random starting points eliminates
sequential bias while maintaining predictable write behavior.
The systematic approach provides consistent L2ARC filling patterns
and better utilization of available ARC data across all sublists.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18093
This implemented support for having multiple datasets unlocked and
mounted when a session is opened.
Example: `homes=rpool/home,tank/users`
Extra unit tests have been added
A man page documents have been added `man 8 pam_zfs_key`. A few
references to the new man page have also been added in other documents.
Signed-off-by: Dennis Vestergaard Værum <github@varum.dk>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
The macro 'flush_dcache_page(...)' modifies the page flags, but in Linux
6.18 the type of the page flags changed from 'unsigned long' to the
struct type 'memdesc_flags_t' with a single member 'f' which is the page
flags field.
Signed-off-by: Erik Larsson <catacombae@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>