These are only required to support these ioctls on Linux <4.5. Since
4.18 is our cutoff, we don't need this code anymore.
Also removing related test things that will never match again.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17308
It makes no sense to limit read size below the block size, since
DMU will any way consume resources for the whole block, while the
current zfs_vnops_read_chunk_size is only 1MB, which is smaller
that maximum block size of 16MB. Plus in case of misaligned
Uncached I/O the buffer may get evicted between the chunks,
requiring repeating I/Os.
On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB.
It allows to less depend on speculative prefetcher if application
requests specific size, first not waiting for prefetcher to start
and later not prefetching more than needed.
Also while there, we don't need to align reads to the chunk size,
but only to a block size, which is smaller and so more forgiving.
My profiles show ~4% of CPU time saving when reading 16MB blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17415
It's been many years, we can probably do without.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17376
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.
This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.
A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.
Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Youzhong Yang <youzhong@gmail.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Closes#16486
This also includes removing L2 vdevs asynchronously.
This commit also guarantees that spa_load_guid is unique.
The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.
However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools. Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16215
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.
So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17445
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17445
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17420
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17420
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17420
Previous dmu_tx_count_clone() was broken, stating that cloning is
similar to free. While they might be from some points, cloning
is not net-free. It will likely consume space and memory, and
unlike free it will do it no matter whether the destination has
the blocks or not (usually not, so previous code did nothing).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17431
Looking on txg_wait_synced(, 0) I've noticed that it always syncs
5 TXGs: 3 TXG_CONCURRENT_STATES + 2 TXG_DEFER_SIZE. But in case
of dmu_offset_next() we do not care about deferred frees. And even
concurrent TXGs we might not need sync all 3 if the dnode was not
dirtied in last few TXGs.
This patch makes dmu_offset_next() to sync one TXG at a time until
the dnode is clean, but no more than 3 TXG_CONCURRENT_STATES times.
My tests with random simultaneous writes and seeks over many files
on HDD pool show 7-14% performance increase.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17434
It is not right, but there are few examples when TX is aborted
after being assigned in case of error. To handle it better on
production systems add extra cleanup steps.
While here, replace couple dmu_tx_abort() in simple cases.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17438
Having high-refcount dedup entries for zero blocks is inefficient
when they could be recorded as a holes instead. Normally, zero
compression is not done if compression is disabled to not confuse
naive benchmarks. But with dedup enabled, it is expected that the
write will be skipped anyway, so we are just optimizing the way it
is skipped.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17435
After #17401 the Linux build produces some stack related warnings.
Silence them with the `STACK_FRAME_NON_STANDARD` macro.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17410
I've noticed that after some dedup tests system reboot ends up in
assertion about ms_defer tree not free. It seems to be caused by
DDT flushing still freeing some blocks while ZFS trying to reach
a final steady state due to spa_final_txg, while being set by
spa_export_common() on pool export, is not set when spa_unload()
is called by spa_evict_all() on system reboot/shutdown. Setting
spa_final_txg in spa_unload() fixes this issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17395
This patch fixes a race where vdev_remove_wanted may be set after probe
initiation, which could otherwise trigger redundant fault and removal.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17400
If the kernel fails to allocate the gendisk, zvo_disk will be NULL, and
derefencing it will explode. So don't do that.
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17396
We silence `objtool` warnings on some object files using
`OBJECT_FILES_NON_STANDARD_some_file.o`. Nowadays `objtool` is
needed for CPU vulnerability mitigations and a lot more
functionality so its use is desirable.
Just remove the `OBJECT_FILES_NON_STANDARD` definitions. A follow-up
commit is needed to make the offending files standard and address
the compile time warnings.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#17401Closes#17364
Three occurences with an 'e', and all of them mine. Maybe it's an
British thing?
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
Since 3.17 Linux has provided param ops for 64-bit ints, so we don't
need to use our own anymore.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
Nothing uses them now.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
The use for spl_taskq_kick was the only use, and the comment that
module_param_call is obsolete is no longer true - it's still very much
used even in recent kernels.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
Nothing in any FreeBSD code uses them.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
If a variable is only available in the kernel, then the tunable should
also only be available there.
This matters very little so long as we don't have userspace tunables,
but its still good hygeine.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
It actually doesn't matter if it's not initialised when we first query
the current value; it just returns empty-string. A crash is quite
obnoxious even if it is a rare case.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
Likely it's only int64 for comparison with ssize_t, which is signed.
However, it would make no sense for it to be less than 0 or greater than
4G, so making it a regular uint will make it safe for comparison and
remove the only S64 tunable in core.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17377
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.
With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17372
It was reported that channel programs' zfs.get_prop doesn't work for
dataset properties encryption and encryptionroot.
They are handled in get_special_prop due to the need to call
dsl_dataset_crypt_stats to load those dsl props.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Graham Christensen <graham@grahamc.com>
Closes#17280
In truenas_pylibzfs, we query list of encrypted datasets several times,
which is expensive. This commit exposes a public API zfs_is_encrypted()
to get encryption status from fast stat path without having to refresh
the properties.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17368
Before this change write log size TXG throttling mechanism was
accounting only user payload bytes. But the actual ZIL both on
disk and especially in memory include headers of hundred(s) of
bytes. Not accouting those may allow applications like
bonnie++, in their wisdom writing one byte at a time, to consume
excessive amount of memory and ZIL/SLOG in one TXG.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17373
Without this fix, zfs_range_tree_find_in could return an overlap when
the found range starts immediately after the searched range, with no
actual overlap.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17363
We don't really need to access space map to know where the metaslab
ends, while msp->ms_sm might be NULL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fixes#17164Fixes#17359Closes#17361
(cherry picked from commit 5c30b24381)
This was caught when doing a manual check to see if #17352 needed to be
improved to catch mismatches across stack frames of the kind that were
first found in #17340.
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard@ryao.dev>
Closes#17353
(cherry picked from commit 83fa80a550)
Renamed in 6.2, and the compat wrapper removed in 6.15. No signature or
functional change apart from that, so a very minimal update for us.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17229
(cherry picked from commit 841be1d049)
The intent is that the filesystem may have a reference to an "old"
version of the new directory, eg if it was keeping it alive because a
remote NFS client still had it open.
We don't need anything like that, so this really just changes things so
we return error codes encoded in pointers.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17229
(cherry picked from commit bb740d66de)
d634d20d1b had been intended to fix a
potential information leak issue where the compiler's optimization
passes appeared to remove `memset()` operations that sanitize sensitive
data before memory is freed for use by the rest of the kernel.
When I wrote it, I had assumed that the compiler would not remove the
other `memset()` operations, but upon reflection, I have realized that
this was a bad assumption to make. I would rather have a very slight
amount of additional overhead when calling `gcm_clear_ctx()` than risk a
future compiler remove `memset()` calls. This is likely to happen if
someone decides to try doing link time optimization and the person will
not think to audit the assembly output for issues like this, so it is
best to preempt the possibility before it happens.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard@ryao.dev>
Closes#17343
(cherry picked from commit d8a33bc0a5)
Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) has been replaced by
dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates
a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing
and the key mapping is not cleared. This may be inadvertedly used, which
results in arc_untransform failing with ECKSUM in:
arc_untransform+0x96/0xb0 [zfs]
dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs]
dbuf_read+0x56/0x770 [zfs]
dmu_buf_hold_by_dnode+0x4a/0x80 [zfs]
zap_lockdir+0x87/0xf0 [zfs]
zap_lookup_norm+0x5c/0xd0 [zfs]
zap_lookup+0x16/0x20 [zfs]
zfs_get_zplprop+0x8d/0x1d0 [zfs]
setup_featureflags+0x267/0x2e0 [zfs]
dmu_send_impl+0xe7/0xcb0 [zfs]
dmu_send_obj+0x265/0x360 [zfs]
zfs_ioc_send+0x10c/0x280 [zfs]
Fix this by restoring the call to dsl_dataset_hold_obj().
The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags().
Both leaked key mappings will cause a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.
Contributions-by: Hank Barta <hbarta@gmail.com>
Contributions-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12014Closes#17340
(cherry picked from commit ea74cdedda)
Currently, commands that resume a scrub/errorscrub from a paused state
don't get logged in the pool history. This is because resumes actually
return ECANCELED, instead of 0. This causes the tsd code in the common
ioctl logic to not think the ioctl succeeded, which causes the
log_history ioctl to fail with EPERM. However, for resuming a scrub from
a paused state, ECANCELED is success.
There are two options for how to deal with this. The first is the one
that I implemented here; I can't find a good reason for dmu_scan to
return ECANCELED on resume instead of 0, so let's just not. The only
place we check for the ECANCELED value is in zpool_scan, where we just
convert it back to zero. However, I am aware that this is changing an
ioctl interface, which I believe is a breaking change. I don't think
it's an important change, but maybe there is someone who relies on it.
The other option that could be implemented is to either allow ECANCELED
specifically from dsl_scan in the common ioctl code, or add a generic
facility to the common ioctl code that allows each command to specify
whether or not success happened, regardless of the return values. I am
open to feedback on which option people think would be better.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17301
(cherry picked from commit 086105f4c4)
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too. Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17314
(cherry picked from commit 89a8a91582)
For UIO_ITER, we are just wrapping a kernel iterator. It will take care
of its own offsets if necessary. We don't need to do anything, and if we
do try to do anything with it (like advancing the iterator by the skip
in zfs_uio_advance) we're just confusing the kernel iterator, ending up
at the wrong position or worse, off the end of the memory region.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17298
(cherry picked from commit 2ee5b51a57)
db.db_mtx must be held any time that db.db_data is accessed. All of
these functions do have the lock held by a parent; add assertions to
ensure that it stays that way.
See https://github.com/openzfs/zfs/discussions/17118
* Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't
required.
* Refactor dbuf_hold_copy to eliminate the db_rwlock
Copy data into the newly allocated buffer before assigning it to the db.
That way, there will be no need to take db->db_rwlock.
* Refactor dbuf_read_hole
In the case of an indirect hole, initialize the newly allocated buffer
before assigning it to the dmu_buf_impl_t.
Sponsored by: ConnectWise
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17209
(cherry picked from commit c17bdc4914)
SYSCTL_SIZEOF() has been introduced in FreeBSD by commit "sysctl(9):
Ease exporting struct sizes; Discourage doing that" (713abc9880aa) in
branch 'main'. It will soon be backported to 'stable/14'. We will thus
be able to remove the old, alternate version left in the '#else' branch
as soon as 'stable/13' goes out of support (April 30, 2026).
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes#17309
(cherry picked from commit 78628a5c15)
With certain combinations of target ARC states balance and ghost
hit rates it was possible to get the fractions outside of allowed
range. This patch limits maximum balance adjustment speed, which
should make it impossible, and also asserts it.
Fixes#17210
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
(cherry picked from commit b1ccab1721)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.
This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.
Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.
The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit a7de203c86)
We should not clear scn_state and notify waiters until we call
vdev_dtl_reassess(), otherwise following offline/detach request
may fail with "no valid replicas".
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
(cherry picked from commit f86d9af16b)
It's possible for two spares to get attached to a single failed vdev.
This happens when you have a failed disk that is spared, and then you
replace the failed disk with a new disk, but during the resilver
the new disk fails, and ZED kicks in a spare for the failed new
disk. This commit checks for that condition and disallows it.
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #16547Closes: #17231
(cherry picked from commit f40ab9e399)
### Background
Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.
Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.
However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
instead, we capture the permissions by saving CRED() in the user task,
and then using it for the check through the secpolicy_* functions.
### Current implementation
The current code calls CRED() to get the credential, which gets a
pointer to the cred_t inside the current task and passes it to the
worker task. However, it doesn't take a reference to the cred_t, and so
expects that it won't change, and that the task continues to exist. In
practice that is always the case, because we don't let the calling task
return from the kernel until the work is done.
For Linux, we also take a reference to the current task, because the
Linux credential APIs for the most part do not check an arbitrary
credential, but rather, query what a task can do. See
secpolicy_zfs_proc(). Again, we don't take a reference on the task, just
a pointer to it.
### Changes
We change to calling crhold() on the task credential, and crfree() when
we're done with it. This ensures it stays alive and unchanged for the
duration of the call.
On the Linux side, we change the main policy checking function
priv_policy_ns() to use override_creds()/revert_creds() if necessary to
make the provided credential active in the current task, allowing the
standard task-permission APIs to do the needed check. Since the task
pointer is no longer required, this lets us entirely remove
secpolicy_zfs_proc() and the need to carry a task pointer around as
well.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
(cherry picked from commit c8fa39b46c)