It is not right, but there are few examples when TX is aborted
after being assigned in case of error. To handle it better on
production systems add extra cleanup steps.
While here, replace couple dmu_tx_abort() in simple cases.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17438
Having high-refcount dedup entries for zero blocks is inefficient
when they could be recorded as a holes instead. Normally, zero
compression is not done if compression is disabled to not confuse
naive benchmarks. But with dedup enabled, it is expected that the
write will be skipped anyway, so we are just optimizing the way it
is skipped.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17435
The `scn_min_txg` can now be used not only with resilver. Instead
of checking `scn_min_txg` to determine whether it’s a resilver or
a scrub, simply check which function is defined. Thanks to this
change, a scrub_finish event is generated when performing a scrub
from the saved txg.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes#17432
* zfs_link: allow tempfile sync to fail if pool suspends
4653e2f7d3 (#17355) allows dmu_tx_assign() to fail if the pool suspends
when failmode=continue, but zfs_link() can fall back to
txg_wait_synced() if it has to wait for a tempfile to be fully created
before continuing, which will block if the pool suspends.
Handle this by requesting an error return if the pool suspends when
failmode=continue, and if that happens, return EIO.
* zfs_clone_range: allow dirty wait to fail if pool suspends
4653e2f7d3 (#17355) allows dmu_tx_assign() to fail if the pool suspends
when failmode=continue, but zfs_clone_range() can fall back to
txg_wait_synced() if it has to wait for a dirty block to be written out,
which will block if the pool suspends.
Handle this by requesting an error return if the pool suspends when
failmode=continue, and if that happens, return EIO.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17413
It makes no sense to limit read size below the block size, since
DMU will any way consume resources for the whole block, while the
current zfs_vnops_read_chunk_size is only 1MB, which is smaller
that maximum block size of 16MB. Plus in case of misaligned
Uncached I/O the buffer may get evicted between the chunks,
requiring repeating I/Os.
On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB.
It allows to less depend on speculative prefetcher if application
requests specific size, first not waiting for prefetcher to start
and later not prefetching more than needed.
Also while there, we don't need to align reads to the chunk size,
but only to a block size, which is smaller and so more forgiving.
My profiles show ~4% of CPU time saving when reading 16MB blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17415
With increasing number of metaslab classes it can be helpful for
debugging to know what we are looking at.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17409
Before this change in case of any allocation error ZFS always fallen
back to normal class. But with more of different classes available
we migth want more sophisticated logic. For example, it makes sense
to fall back from dedup first to special class (if it is allowed to
put DDT there) and only then to normal, since in a pool with dedup
and special classes populated normal class likely has performance
characteristics unsuitable for dedup.
This change implements general mechanism where fallback order is
controlled by the same spa_preferred_class() as the initial class
selection. And as first application it implements the mentioned
dedup->special->normal fallbacks. I have more plans for it later.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17391
The module parameter name was not changed in FreeBSD sysctls
list: 'vfs.zfs.vol.mode'. Also, on Linux side the name is:
/sys/module/zfs/parameters/zvol_volmode.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17386
The module parameter now is represented in FreeBSD sysctls list
with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072,
same as on Linux side.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17385
The child locking difference is simple enough to handle with a boolean.
The actual work is more involved, and it's easy to forget to change
things in both places when experimenting. Just collapse them.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17382
Rewrite is a one-time/rare bulk administrative operation, which
should minimally affect payload caching. Plus some avoided memory
copies in its data path allow to significantly increase its speed.
My tests show reduction of time to rewrite 28GB of uncompressed
files on NVMe pool from 17 to 9 seconds and minimal ARC usage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17407
Make 'zvol_threads', 'zvol_num_taskqs' and 'zvol_request_sync' names
compatible with FreeBSD sysctl naming convention. Now the sysctls are
have a next form:
$ sysctl vfs.zfs.vol.threads
vfs.zfs.vol.threads: 0
$ sysctl vfs.zfs.vol.num_taskqs
vfs.zfs.vol.num_taskqs: 0
$ sysctl vfs.zfs.vol.request_sync
vfs.zfs.vol.request_sync: 0
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17406
I've noticed that after some dedup tests system reboot ends up in
assertion about ms_defer tree not free. It seems to be caused by
DDT flushing still freeing some blocks while ZFS trying to reach
a final steady state due to spa_final_txg, while being set by
spa_export_common() on pool export, is not set when spa_unload()
is called by spa_evict_all() on system reboot/shutdown. Setting
spa_final_txg in spa_unload() fixes this issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17395
This patch fixes a race where vdev_remove_wanted may be set after probe
initiation, which could otherwise trigger redundant fault and removal.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17400
Usually the IO type can be inferred from the other fields (in
particular, priority and flags) sometimes it's not easy to see. This is
just another little debug helper.
May 27 2025 00:54:54.024110493 ereport.fs.zfs.data
class = "ereport.fs.zfs.data"
ena = 0x1f5ecfae600801
...
zio_delta = 0x0
zio_type = 0x2 [WRITE]
zio_priority = 0x3 [ASYNC_WRITE]
zio_objset = 0x0
Document zio_type and zio_priority.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17381
The module parameter now is represented in FreeBSD sysctls list with
name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on
Linux side.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17384
As commit 320f0c6 did for Linux, connect POSIX_FADV_WILLNEED
up to dmu_prefetch() on FreeBSD.
While there, fix portability problems in tests/functional/fadvise.
1. Instead of relying on the numerical values of POSIX_FADV_XXX macros,
accept macro names as arguments to the file_fadvise program. (The
numbers happen to match on Linux and FreeBSD, but future systems may
vary and it seems a little strange/raw to count on that.)
2. For implementation reasons, SEQUENTIAL doesn't reach ZFS via FreeBSD
VFS currently (perhaps something that should be investigated in
FreeBSD). Since on Linux we're treating SEQUENTIAL and WILLNEED the
same, it doesn't really matter which one we use, so switch the test
over to WILLNEED exercise the new prefetch code on both OSes the
same way.
Reviewed-by: Mateusz Guzik <mjg@FreeBSD.org>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Thomas Munro <tmunro@FreeBSD.org>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes#17379
Three occurences with an 'e', and all of them mine. Maybe it's an
British thing?
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
If a variable is only available in the kernel, then the tunable should
also only be available there.
This matters very little so long as we don't have userspace tunables,
but its still good hygeine.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
It actually doesn't matter if it's not initialised when we first query
the current value; it just returns empty-string. A crash is quite
obnoxious even if it is a rare case.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
Likely it's only int64 for comparison with ssize_t, which is signed.
However, it would make no sense for it to be less than 0 or greater than
4G, so making it a regular uint will make it safe for comparison and
remove the only S64 tunable in core.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.
With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17372
This is the cheap way to keep non-user functions working after
break-on-suspend becomes default.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17355
This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool
suspends while they're waiting, rather than just on the initial check
before falling back into a wait.
Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore
suspend entirely, effectively returning to the previous behaviour.
With that, it shouldn't be possible for anything with a standard
dmu_tx_assign/wait/abort loop to block under failmode=continue.
Also should be a bit tighter than the old behaviour, where a
VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool
is already suspended and failmode=continue.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17355
Mostly for a little more type checking and debugging visibility.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17355
This allows a caller to request a wait for txg sync, with an appropriate
error return if the pool is suspended or becomes suspended during the
wait.
To support this, txg_wait_kick() is added to signal the sync condvar,
which wakes up the waiters, causing them to loop and reconsider their
wait conditions again. zio_suspend() now calls this to trigger the break
if the pool suspends while waiting.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17355
It was reported that channel programs' zfs.get_prop doesn't work for
dataset properties encryption and encryptionroot.
They are handled in get_special_prop due to the need to call
dsl_dataset_crypt_stats to load those dsl props.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Graham Christensen <graham@grahamc.com>
Closes#17280
It's been many years, we can probably do without.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17376
In truenas_pylibzfs, we query list of encrypted datasets several times,
which is expensive. This commit exposes a public API zfs_is_encrypted()
to get encryption status from fast stat path without having to refresh
the properties.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17368
Before this change write log size TXG throttling mechanism was
accounting only user payload bytes. But the actual ZIL both on
disk and especially in memory include headers of hundred(s) of
bytes. Not accouting those may allow applications like
bonnie++, in their wisdom writing one byte at a time, to consume
excessive amount of memory and ZIL/SLOG in one TXG.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17373
Without this fix, zfs_range_tree_find_in could return an overlap when
the found range starts immediately after the searched range, with no
actual overlap.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17363
We don't really need to access space map to know where the metaslab
ends, while msp->ms_sm might be NULL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fixes#17164Fixes#17359Closes#17361
This was caught when doing a manual check to see if #17352 needed to be
improved to catch mismatches across stack frames of the kind that were
first found in #17340.
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard@ryao.dev>
Closes#17353
Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) has been replaced by
dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates
a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing
and the key mapping is not cleared. This may be inadvertedly used, which
results in arc_untransform failing with ECKSUM in:
arc_untransform+0x96/0xb0 [zfs]
dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs]
dbuf_read+0x56/0x770 [zfs]
dmu_buf_hold_by_dnode+0x4a/0x80 [zfs]
zap_lockdir+0x87/0xf0 [zfs]
zap_lookup_norm+0x5c/0xd0 [zfs]
zap_lookup+0x16/0x20 [zfs]
zfs_get_zplprop+0x8d/0x1d0 [zfs]
setup_featureflags+0x267/0x2e0 [zfs]
dmu_send_impl+0xe7/0xcb0 [zfs]
dmu_send_obj+0x265/0x360 [zfs]
zfs_ioc_send+0x10c/0x280 [zfs]
Fix this by restoring the call to dsl_dataset_hold_obj().
The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags().
Both leaked key mappings will cause a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.
Contributions-by: Hank Barta <hbarta@gmail.com>
Contributions-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12014Closes#17340
UIO_DIRECT means we can do Direct I/O, while DMU_DIRECTIO we want
to do it. First does not automatically means second. Add few
checks to not use Direct I/O in few cases we don't want it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17342
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset. It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.
This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17339
Currently, commands that resume a scrub/errorscrub from a paused state
don't get logged in the pool history. This is because resumes actually
return ECANCELED, instead of 0. This causes the tsd code in the common
ioctl logic to not think the ioctl succeeded, which causes the
log_history ioctl to fail with EPERM. However, for resuming a scrub from
a paused state, ECANCELED is success.
There are two options for how to deal with this. The first is the one
that I implemented here; I can't find a good reason for dmu_scan to
return ECANCELED on resume instead of 0, so let's just not. The only
place we check for the ECANCELED value is in zpool_scan, where we just
convert it back to zero. However, I am aware that this is changing an
ioctl interface, which I believe is a breaking change. I don't think
it's an important change, but maybe there is someone who relies on it.
The other option that could be implemented is to either allow ECANCELED
specifically from dsl_scan in the common ioctl code, or add a generic
facility to the common ioctl code that allows each command to specify
whether or not success happened, regardless of the return values. I am
open to feedback on which option people think would be better.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17301
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.
This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.
A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.
Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Youzhong Yang <youzhong@gmail.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Closes#16486
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too. Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17314
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O. It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable. Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.
While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations. It require I/Os
to be page aligned, does not allow speculative prefetch, etc. The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O. As such it should fill the gap in between. Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.
To pass the information between the layers I had to change a number
of APIs. But as side effect upper layers can now control not only
the caching, but also speculative prefetch. I haven't wired it to
VFS yet, since it require looking on some OS specifics. But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fixes#17027
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Nothing modifies them, and nothing should, so lets try to enforce that.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values. It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing. Also since it is
protected by normal range range locks, it can be done under any
other load. Also it does not affect file's modification time or
other properties.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
db.db_mtx must be held any time that db.db_data is accessed. All of
these functions do have the lock held by a parent; add assertions to
ensure that it stays that way.
See https://github.com/openzfs/zfs/discussions/17118
* Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't
required.
* Refactor dbuf_hold_copy to eliminate the db_rwlock
Copy data into the newly allocated buffer before assigning it to the db.
That way, there will be no need to take db->db_rwlock.
* Refactor dbuf_read_hole
In the case of an indirect hole, initialize the newly allocated buffer
before assigning it to the dmu_buf_impl_t.
Sponsored by: ConnectWise
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17209
Make zvol I/O requests processing asynchronous on FreeBSD side in some
cases. Clone zvol threading logic and required module parameters from
Linux side. Make zvol threadpool creation/destruction logic shared for
both Linux and FreeBSD.
The IO requests are processed asynchronously in next cases:
- volmode=geom: if IO request thread is geom thread or cannot sleep.
- volmode=cdev: if IO request passed thru struct cdevsw .d_strategy
routine, mean is AIO request.
In all other cases the IO requests are processed synchronously. The
volthreading zvol property is ignored on FreeBSD side.
Sponsored-by: vStack, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: @ImAwsumm
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#17169
It's been dead ever since 5fa356ea44
Sponsored by: ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17119
With certain combinations of target ARC states balance and ghost
hit rates it was possible to get the fractions outside of allowed
range. This patch limits maximum balance adjustment speed, which
should make it impossible, and also asserts it.
Fixes#17210
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
When forced to resort to ganging, ZFS currently allocates three child
blocks, each one third of the size of the original. This is true
regardless of whether larger allocations could be made, which would
allow us to have fewer gang leaves. This improves performance when
fragmentation is high enough to require ganging, but not so high that
all the free ranges are only just big enough to hold a third of the
recordsize. This is also useful for improving the behavior of a future
change to allow larger gang headers.
We add the ability for the allocation codepath to allocate a range of
sizes instead of a single fixed size. We then use this to pre-allocate
the DVAs for the gang children. If those allocations fail, we fall back
to the normal write path, which will likely re-gang.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.
This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.
Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.
The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>