The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.
We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
spa_vdev_remove_thread() should not hold svr_lock while loading a
metaslab. It may block ZIO threads, required to handle metaslab
loading, at least in case of read errors causing recovery writes.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17145
The vd->vdev_ms access can overflow due to on-disk corruption, not just
due to programming bugs. So it makes sense to check its boundaries even
in production builds.
Sponsored by: ConnectWise
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17136
- Don't drop L2ARC header if we have more buffers in this header.
Since we leave them the header, leave them the L2ARC header also.
Honestly we are not required to drop it even if there are no other
buffers, but then we'd need to allocate it a separate header, which
we might drop soon if the old block is really deleted. Multiple
buffers in a header likely mean active snapshots or dedup, so we
know that the block in L2ARC will remain valid. It might be rare,
but why not?
- Remove some impossible assertions and conditions.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17126
This helps to avoids confusion with the similarly-named
txg_wait_synced().
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.
This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.
Fixes#16943
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes#17015
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Don Brady <don.brady@klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17038
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry. But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.
This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested. And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17120
Since embedded blocks introduction 11 years ago, their writing was
blocked if dedup is enabled. After searching through the modern
code I see no reason for this restriction to exist. Same time
embedded blocks are dramatically cheaper. Even regular write of
so small blocks would likely be cheaper than deduplication, even
if the last is successful, not mentioning otherwise.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17113
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.
We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16972
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:
```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```
This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.
And in zdb, don't crash when printing a block pointer with no valid
DVAs. If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug. zdb should be able to handle the
situation gracefully.
Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR. This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why. It doesn't appear to do anything useful, so remove it.
[^1]: b24ab67627
Fixes #17077
Sponsored by: ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17078
PR #14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors
if local MAC can not be calculated at the time. But it does not
mean we should also ignore portable MAC errors there.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17122
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17071
Don't try to get mg of hole vdev in removal
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17080
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes#17088
Before this change zfs_metaslab_switch_threshold tunable switched
metaslabs each time ones index reduced by two (which means biggest
contiguous chunk reduced to 1/4). It is a good idea to balance
metaslabs fragmentation. But for empty metaslabs (having power-
of-2 sizes) this means switching when they get just below the half
of their capacity. Inspection with zdb after filling new pool to
half capacity shown most of its metaslabs filled to half capacity.
I consider this sub-optimal for pool fragmentation in a long run.
This change blocks the metaslabs switching if most of the metaslab
free space (15/16) is represented by a single contiguous range.
Such metaslab should not be considered fragmented until it actually
fail some big allocation. More contiguous filling should improve
data locality and increase time before previously filled and
partially freed metaslab is touched again, giving it more time to
free more contiguous chunks for lower fragmentation. It should
also slightly reduce spacemap traffic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17081
zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the
zio_taskq thread is active and blocked both while waiting for the IO
call and then while calling zio_execute() for the next stage. This is a
particular issue for FLUSH, as the z_flush_iss queue typically only has
one thread; multiple flushes arriving at once can cause long delays if
the underlying fsync() response is particularly slow.
To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq,
just as we do for reads and writes. Further, we return all results
through zio_interrupt(), so neither the issue nor the file taskqs are
blocked.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17064
Need to use arc_free_data_abd to free abd type buffer.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#17079
Kernel & userspace specifics are in zfs_file_os.c, so there's no
particular reason these have to be separate.
The one platform-specific part is in the Linux kernel part, to offload
flushes to a taskq if we're already inside a filesystem transaction.
This would be normally be an unsatisfying wart, but I'm intending to
remove this shortly, so I'm content to leave it gated for the moment.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since we are calculating a free space fragmentation, we should
weight metaslabs by the amount of their free space, not a full
size. Fragmentation of full metaslabs may not matter in presence
empty ones. The old algorithm did not differentiate metaslabs
having only one free 4KB block from metaslabs having 50% of space
free in 4KB blocks, reporting higher fragmentation.
While there, move metaslab_group_alloc_update() call after setting
mg_fragmentation, otherwise the effect may be delayed by one TXG.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols.
Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
The purpose of no-op is to simulate a failure between a device cache and
its permanent store. We still want it to go through the queue and
respond in the same way to everything else.
So, inject "success" as the very last thing, and then move on to
VDEV_IO_DONE to be dequeued and so any followup work can occur.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17029
Gang blocks have a significant impact on the long and short term
performance of a zpool, but there is not a lot of observability into
whether they're being used. This change adds gang-specific kstats to
ZFS, to better allow users to see whether ganging is happening.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17003
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.
We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#16986
As zios are reexecuted after resume from suspension, their ready and
wait states need to be propagated to wait counts on all their parents.
It's possible for those parents to have active children passing through
READY or DONE, which then end up in zio_notify_parent(), take their
parent's lock, and decrement the wait count. Without also taking a lock
here, it's possible for an increment race to occur, which leads to
either there being no references left (tripping the assert in
zio_notify_parent()), or a parent waiting forever for a nonexistent
child to complete.
To protect against this, we simply take the appropriate zio locks in
zio_reexecute() before updating the wait counts.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17016
This change will prevent prefetch to perform unnecessary ARC buffer
fill when reading from disk.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes#17013
Injecting a device probe failure is not possible by matching IO types,
because probe IO goes to the label regions, which is explicitly excluded
from injection. Even if it were possible, it would be awkward to do,
because a probe is sequence of reads and writes.
This commit adds a new IO "type" to match for injection, which looks for
the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the
injection record and recieve the wanted error.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16947
I'm about to add a new "type", and I need somewhere to put it!
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16947
When building tests with zinject, it can be quite difficult to work out
if you're producing the right kind of IO to match the rules you've set
up.
So, here we extend injection records to count the number of times a
handler matched the operation, and how often an error was actually
injected (ie after frequency and other exclusions are applied).
Then, display those counts in the `zinject` output.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#16938
Similar to what we saw in #16569, we need to consider that a
replacing vdev should not be considered as fully contributing
to the redundancy of a raidz vdev even though current IO has
enough redundancy.
When a failed vdev_probe() is faulting a disk, it now checks
if that disk is required, and if so it suspends the pool until
the admin can return the missing disks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16864
It's possible for a vdev to be flagged for async remove after the pool
has suspended. If the removed device has been returned when the pool is
resumed, the ASYNC_REMOVE task will still run at the end of txg, and
remove the device from the pool again.
To fix, we clear the async remove flag at reopen, just as we did for the
async fault flag in 5de3ac223.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16921
In Linux, block devices currently lack support for `copy_file_range`
API because the kernel does not provide the necessary functionality.
However, there is an ongoing upstream effort to address this
limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/.
We have adopted this upstream kernel patch into the TrueNAS kernel and
made some additional modifications to enable block cloning specifically
for the zvol block device. This patch implements the platform-
independent portions of these changes for inclusion in OpenZFS.
This patch does not introduce any new functionality directly into
OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when
zvols are migrated to non-TrueNAS systems that support Clone Range
replay in the ZIL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16901
People have noted there's no way to remove a pool userprop, only zero
it. Turns vdev userprops had a method, by setting empty-string. So this
makes pool userprops follow the same behaviour.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16887
The count of chunks in a microzap block is stored as an uint16_t
(mze_chunkid). Each chunk is 64 bytes, and the first is used to store a
header, so there are 32767 usable chunks, which is just under 2M. 1M is
the largest power-2-rounded block size under 2M, so we must set the
limit there.
If it goes higher, the loop in mzap_addent can overflow and fall into
the PANIC case.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16888
VDEV_PROP_USERPROP is equal do VDEV_PROP_INVAL and so is not a real
property. That's why vdev_prop_readonly() does not work right for
it. In particular it may declare all vdev user properties readonly
on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16890
We should not dereference rra after the last zio_nowait() is called.
It seems very unlikely, but ASAN in ztest managed to catch it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16868
It seems there's no good reason for vdev_disk & vdev_geom to explicitly
detect no support for flush and set vdev_nowritecache. Instead, just
signal it by setting the error to ENOTSUP, and let zio_vdev_io_assess()
take care of it in one place.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16855
The first time a device returns ENOTSUP in repsonse to a flush request,
we set vdev_nowritecache so we don't issue flushes in the future and
instead just pretend the succeeded. However, we still return an error
for the initial flush, even though we just decided such errors are
meaningless!
So, when setting vdev_nowritecache in response to a flush error, also
reset the error code to assume success.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16855
In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db
except for overridden case. However, this exception causes a race where
dbuf_new_size could free the arc_buf after the last dereference of
*datap and causes use-after-free. We fix this by cloning the buf
regardless if it's overridden.
The race:
--
P0 P1
dbuf_hold_impl()
// dbuf_hold_copy passed
// because db_data_pending NULL
dbuf_sync_leaf()
// doesn't clone *datap
// *datap derefed to db_buf
dbuf_write(*datap)
dbuf_new_size()
dmu_buf_will_dirty()
dbuf_fix_old_data()
// alloc new buf for P0 dr
// but can't change *datap
arc_alloc_buf()
arc_buf_destroy()
// alloc new buf for db_buf
// and destroy old buf
dbuf_write() // continue
abd_get_from_buf(data->b_data,
arc_buf_size(data))
// use-after-free
--
Here's an example when it happens:
BUG: kernel NULL pointer dereference, address: 000000000000002e
RIP: 0010:arc_buf_size+0x1c/0x30 [zfs]
Call Trace:
dbuf_write+0x3ff/0x580 [zfs]
dbuf_sync_leaf+0x13c/0x530 [zfs]
dbuf_sync_list+0xbf/0x120 [zfs]
dnode_sync+0x3ea/0x7a0 [zfs]
sync_dnodes_task+0x71/0xa0 [zfs]
taskq_thread+0x2b8/0x4e0 [spl]
kthread+0x112/0x130
ret_from_fork+0x1f/0x30
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes#16854
When vdev first sees some block cloning, there is a window when
brt_maybe_exists() might already return true since something was
cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet
created. In such case we should not try to look into the ZAP
and dereference NULL bv_mos_entries_dnode.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16851
cstyle can handle these cases now, so we don't need to disable it.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16840
- Instead of copying one ashift-sized block per ZIO, copy as much
as we have contiguous data up to 16MB per old vdev. To avoid data
moves use gang ABDs, so that read ZIOs can directly fill buffers
for write ZIOs. ABDs have much smaller overhead than ZIOs in both
memory usage and processing time, plus big I/Os do not depend on
I/O aggregation and scheduling to reach decent performance on HDDs.
- Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms.
- Use 32bit range tree when possible (practically always now) to
slightly reduce memory usage.
- Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as
for main ones.
- Fix rate overflows in `zpool status` reporting.
With these changes expanding RAIDZ1 from 4 to 5 children I am able
to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are
limited by devices instead of CPU.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15680Closes#16819
Same as writes block cloning can increase block size and number of
indirection levels. That means it can dirty block 0 at level 0 or
at new top indirection level without explicitly holding them.
A block cloning test case for large offsets has been added.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16825
This also includes removing L2 vdevs asynchronously.
This commit also guarantees that spa_load_guid is unique.
The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.
However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools. Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16215
- Issue prescient prefetches for demand indirect blocks after the
first one. It should be quite rare for reads/writes, but much more
useful for cloning due to much bigger (up to 1022 blocks) accesses.
It covers the gap during the first couple accesses when we can not
speculate yet, but we know what is needed right now. It reduces
dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode().
- Increase maximum prefetch distance for indirect blocks from 64
to 128MB. It should cover the maximum 1022 blocks of block cloning
access size in case of default 128KB recordsize used. In case of
bigger recordsize the above prescient prefetch should also help.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16814