Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same. Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17183
Previous code allowed each metaslab group to have different number
of allocators. But in practice it worked only for embedded SLOGs,
relying on a number of conditions and creating a significant mine
field if any of those change. I just stepped on one myself.
This change makes all groups to have spa_alloc_count allocators.
It may cost us extra 192 bytes of memory per normal top-level vdev
on large systems, but I find it a small price for cleaner and more
reliable code.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fixes#17188
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17187
When after device removal we handle block pointers remap, skip blocks
that might be cloned. BRTs are indexed by vdev id and offset from
block pointer's DVA[0]. So if we start addressing the same block by
some different DVA, we won't get the proper reference counter. As
result, we might either remap the block twice, that may result in
assertion during indirect mapping condense, or free it prematurely,
that may result in data overwrite, or free it twice, that may result
in assertion in spacemap code.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15604Closes#17180
Existing allocation throttling had a goal to improve write speed
by allocating more data to vdevs that are able to write it faster.
But in the process it completely broken the original mechanism,
designed to balance vdev space usage. With severe vdev space use
imbalance it is possible that some with higher use start growing
fragmentation sooner than others and after getting full will stop
any writes at all. Also after vdev addition it might take a very
long time for pool to restore the balance, since the new vdev does
not have any real preference, unless the old one is already much
slower due to fragmentation. Also the old throttling was request-
based, which was unpredictable with block sizes varying from 512B
to 16MB, neither it made much sense in case of I/O aggregation,
when its 32-100 requests could be aggregated into few, leaving
device underutilized, submitting fewer and/or shorter requests,
or in opposite try to queue up to 1.6GB of writes per device.
This change presents a completely new throttling algorithm. Unlike
the request-based old one, this one measures allocation queue in
bytes. It makes possible to integrate with the reworked allocation
quota (aliquot) mechanism, which is also byte-based. Unlike the
original code, balancing the vdevs amounts of free space, this one
balances their free/used space fractions. It should result in a
lower and more uniform fragmentation in a long run.
This algorithm still allows to improve write speed by allocating
more data to faster vdevs, but does it in more controllable way.
On top of space-based allocation quota, it also calculates minimum
queue depth that vdev is allowed to maintain, and respectively the
amount of extra allocations it can receive if it appear faster.
That amount is based on vdev's capacity and space usage, but also
applied only when the pool is busy. This way the code can choose
between faster writes when needed and better vdev balance when not,
with the choice gradually reducing together with the free space.
This change also makes allocation queues per-class, allowing them
to throttle independently and in parallel. Allocations that are
bounced between classes due to allocation errors will be able to
properly throttle in the new class. Allocations that should not
be throttled (ZIL, gang, copies) are not, but may still follow
the rotor and allocation quota mechanism of the class without
disrupting it.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.
We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
spa_vdev_remove_thread() should not hold svr_lock while loading a
metaslab. It may block ZIO threads, required to handle metaslab
loading, at least in case of read errors causing recovery writes.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17145
The vd->vdev_ms access can overflow due to on-disk corruption, not just
due to programming bugs. So it makes sense to check its boundaries even
in production builds.
Sponsored by: ConnectWise
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17136
- Don't drop L2ARC header if we have more buffers in this header.
Since we leave them the header, leave them the L2ARC header also.
Honestly we are not required to drop it even if there are no other
buffers, but then we'd need to allocate it a separate header, which
we might drop soon if the old block is really deleted. Multiple
buffers in a header likely mean active snapshots or dedup, so we
know that the block in L2ARC will remain valid. It might be rare,
but why not?
- Remove some impossible assertions and conditions.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17126
This helps to avoids confusion with the similarly-named
txg_wait_synced().
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.
This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.
Fixes#16943
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes#17015
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Don Brady <don.brady@klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17038
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry. But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.
This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested. And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17120
Since embedded blocks introduction 11 years ago, their writing was
blocked if dedup is enabled. After searching through the modern
code I see no reason for this restriction to exist. Same time
embedded blocks are dramatically cheaper. Even regular write of
so small blocks would likely be cheaper than deduplication, even
if the last is successful, not mentioning otherwise.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17113
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.
We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16972
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:
```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```
This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.
And in zdb, don't crash when printing a block pointer with no valid
DVAs. If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug. zdb should be able to handle the
situation gracefully.
Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR. This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why. It doesn't appear to do anything useful, so remove it.
[^1]: b24ab67627
Fixes #17077
Sponsored by: ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#17078
PR #14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors
if local MAC can not be calculated at the time. But it does not
mean we should also ignore portable MAC errors there.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17122
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17071
Don't try to get mg of hole vdev in removal
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17080
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes#17088
Before this change zfs_metaslab_switch_threshold tunable switched
metaslabs each time ones index reduced by two (which means biggest
contiguous chunk reduced to 1/4). It is a good idea to balance
metaslabs fragmentation. But for empty metaslabs (having power-
of-2 sizes) this means switching when they get just below the half
of their capacity. Inspection with zdb after filling new pool to
half capacity shown most of its metaslabs filled to half capacity.
I consider this sub-optimal for pool fragmentation in a long run.
This change blocks the metaslabs switching if most of the metaslab
free space (15/16) is represented by a single contiguous range.
Such metaslab should not be considered fragmented until it actually
fail some big allocation. More contiguous filling should improve
data locality and increase time before previously filled and
partially freed metaslab is touched again, giving it more time to
free more contiguous chunks for lower fragmentation. It should
also slightly reduce spacemap traffic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17081
zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the
zio_taskq thread is active and blocked both while waiting for the IO
call and then while calling zio_execute() for the next stage. This is a
particular issue for FLUSH, as the z_flush_iss queue typically only has
one thread; multiple flushes arriving at once can cause long delays if
the underlying fsync() response is particularly slow.
To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq,
just as we do for reads and writes. Further, we return all results
through zio_interrupt(), so neither the issue nor the file taskqs are
blocked.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17064
Need to use arc_free_data_abd to free abd type buffer.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#17079
Kernel & userspace specifics are in zfs_file_os.c, so there's no
particular reason these have to be separate.
The one platform-specific part is in the Linux kernel part, to offload
flushes to a taskq if we're already inside a filesystem transaction.
This would be normally be an unsatisfying wart, but I'm intending to
remove this shortly, so I'm content to leave it gated for the moment.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since we are calculating a free space fragmentation, we should
weight metaslabs by the amount of their free space, not a full
size. Fragmentation of full metaslabs may not matter in presence
empty ones. The old algorithm did not differentiate metaslabs
having only one free 4KB block from metaslabs having 50% of space
free in 4KB blocks, reporting higher fragmentation.
While there, move metaslab_group_alloc_update() call after setting
mg_fragmentation, otherwise the effect may be delayed by one TXG.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols.
Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
The purpose of no-op is to simulate a failure between a device cache and
its permanent store. We still want it to go through the queue and
respond in the same way to everything else.
So, inject "success" as the very last thing, and then move on to
VDEV_IO_DONE to be dequeued and so any followup work can occur.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17029
Gang blocks have a significant impact on the long and short term
performance of a zpool, but there is not a lot of observability into
whether they're being used. This change adds gang-specific kstats to
ZFS, to better allow users to see whether ganging is happening.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17003
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.
We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#16986
As zios are reexecuted after resume from suspension, their ready and
wait states need to be propagated to wait counts on all their parents.
It's possible for those parents to have active children passing through
READY or DONE, which then end up in zio_notify_parent(), take their
parent's lock, and decrement the wait count. Without also taking a lock
here, it's possible for an increment race to occur, which leads to
either there being no references left (tripping the assert in
zio_notify_parent()), or a parent waiting forever for a nonexistent
child to complete.
To protect against this, we simply take the appropriate zio locks in
zio_reexecute() before updating the wait counts.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17016
This change will prevent prefetch to perform unnecessary ARC buffer
fill when reading from disk.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes#17013
Injecting a device probe failure is not possible by matching IO types,
because probe IO goes to the label regions, which is explicitly excluded
from injection. Even if it were possible, it would be awkward to do,
because a probe is sequence of reads and writes.
This commit adds a new IO "type" to match for injection, which looks for
the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the
injection record and recieve the wanted error.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16947
I'm about to add a new "type", and I need somewhere to put it!
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16947
When building tests with zinject, it can be quite difficult to work out
if you're producing the right kind of IO to match the rules you've set
up.
So, here we extend injection records to count the number of times a
handler matched the operation, and how often an error was actually
injected (ie after frequency and other exclusions are applied).
Then, display those counts in the `zinject` output.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#16938
Similar to what we saw in #16569, we need to consider that a
replacing vdev should not be considered as fully contributing
to the redundancy of a raidz vdev even though current IO has
enough redundancy.
When a failed vdev_probe() is faulting a disk, it now checks
if that disk is required, and if so it suspends the pool until
the admin can return the missing disks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16864
It's possible for a vdev to be flagged for async remove after the pool
has suspended. If the removed device has been returned when the pool is
resumed, the ASYNC_REMOVE task will still run at the end of txg, and
remove the device from the pool again.
To fix, we clear the async remove flag at reopen, just as we did for the
async fault flag in 5de3ac223.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16921
In Linux, block devices currently lack support for `copy_file_range`
API because the kernel does not provide the necessary functionality.
However, there is an ongoing upstream effort to address this
limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/.
We have adopted this upstream kernel patch into the TrueNAS kernel and
made some additional modifications to enable block cloning specifically
for the zvol block device. This patch implements the platform-
independent portions of these changes for inclusion in OpenZFS.
This patch does not introduce any new functionality directly into
OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when
zvols are migrated to non-TrueNAS systems that support Clone Range
replay in the ZIL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16901
People have noted there's no way to remove a pool userprop, only zero
it. Turns vdev userprops had a method, by setting empty-string. So this
makes pool userprops follow the same behaviour.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16887
The count of chunks in a microzap block is stored as an uint16_t
(mze_chunkid). Each chunk is 64 bytes, and the first is used to store a
header, so there are 32767 usable chunks, which is just under 2M. 1M is
the largest power-2-rounded block size under 2M, so we must set the
limit there.
If it goes higher, the loop in mzap_addent can overflow and fall into
the PANIC case.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16888
VDEV_PROP_USERPROP is equal do VDEV_PROP_INVAL and so is not a real
property. That's why vdev_prop_readonly() does not work right for
it. In particular it may declare all vdev user properties readonly
on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16890
We should not dereference rra after the last zio_nowait() is called.
It seems very unlikely, but ASAN in ztest managed to catch it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16868
It seems there's no good reason for vdev_disk & vdev_geom to explicitly
detect no support for flush and set vdev_nowritecache. Instead, just
signal it by setting the error to ENOTSUP, and let zio_vdev_io_assess()
take care of it in one place.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16855
The first time a device returns ENOTSUP in repsonse to a flush request,
we set vdev_nowritecache so we don't issue flushes in the future and
instead just pretend the succeeded. However, we still return an error
for the initial flush, even though we just decided such errors are
meaningless!
So, when setting vdev_nowritecache in response to a flush error, also
reset the error code to assume success.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16855
In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db
except for overridden case. However, this exception causes a race where
dbuf_new_size could free the arc_buf after the last dereference of
*datap and causes use-after-free. We fix this by cloning the buf
regardless if it's overridden.
The race:
--
P0 P1
dbuf_hold_impl()
// dbuf_hold_copy passed
// because db_data_pending NULL
dbuf_sync_leaf()
// doesn't clone *datap
// *datap derefed to db_buf
dbuf_write(*datap)
dbuf_new_size()
dmu_buf_will_dirty()
dbuf_fix_old_data()
// alloc new buf for P0 dr
// but can't change *datap
arc_alloc_buf()
arc_buf_destroy()
// alloc new buf for db_buf
// and destroy old buf
dbuf_write() // continue
abd_get_from_buf(data->b_data,
arc_buf_size(data))
// use-after-free
--
Here's an example when it happens:
BUG: kernel NULL pointer dereference, address: 000000000000002e
RIP: 0010:arc_buf_size+0x1c/0x30 [zfs]
Call Trace:
dbuf_write+0x3ff/0x580 [zfs]
dbuf_sync_leaf+0x13c/0x530 [zfs]
dbuf_sync_list+0xbf/0x120 [zfs]
dnode_sync+0x3ea/0x7a0 [zfs]
sync_dnodes_task+0x71/0xa0 [zfs]
taskq_thread+0x2b8/0x4e0 [spl]
kthread+0x112/0x130
ret_from_fork+0x1f/0x30
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes#16854
When vdev first sees some block cloning, there is a window when
brt_maybe_exists() might already return true since something was
cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet
created. In such case we should not try to look into the ZAP
and dereference NULL bv_mos_entries_dnode.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16851
cstyle can handle these cases now, so we don't need to disable it.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16840
- Instead of copying one ashift-sized block per ZIO, copy as much
as we have contiguous data up to 16MB per old vdev. To avoid data
moves use gang ABDs, so that read ZIOs can directly fill buffers
for write ZIOs. ABDs have much smaller overhead than ZIOs in both
memory usage and processing time, plus big I/Os do not depend on
I/O aggregation and scheduling to reach decent performance on HDDs.
- Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms.
- Use 32bit range tree when possible (practically always now) to
slightly reduce memory usage.
- Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as
for main ones.
- Fix rate overflows in `zpool status` reporting.
With these changes expanding RAIDZ1 from 4 to 5 children I am able
to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are
limited by devices instead of CPU.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15680Closes#16819
Same as writes block cloning can increase block size and number of
indirection levels. That means it can dirty block 0 at level 0 or
at new top indirection level without explicitly holding them.
A block cloning test case for large offsets has been added.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16825
This also includes removing L2 vdevs asynchronously.
This commit also guarantees that spa_load_guid is unique.
The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.
However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools. Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16215
- Issue prescient prefetches for demand indirect blocks after the
first one. It should be quite rare for reads/writes, but much more
useful for cloning due to much bigger (up to 1022 blocks) accesses.
It covers the gap during the first couple accesses when we can not
speculate yet, but we know what is needed right now. It reduces
dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode().
- Increase maximum prefetch distance for indirect blocks from 64
to 128MB. It should cover the maximum 1022 blocks of block cloning
access size in case of default 128KB recordsize used. In case of
bigger recordsize the above prescient prefetch should also help.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16814
In some cases like dsl_dataset_hold_obj() it is possible to handle
those errors, so failure to hold dataset should be better than
kernel panic. Some other places where these errors are still not
handled but asserted should be less dangerous just as unreachable.
We have a user report about pool corruption leading to assertions
on these errors. Hopefully this will make behavior a bit nicer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16836
Some users might want to scrub only new data because they would like
to know if the new write wasn't corrupted. This PR adds possibility
scrub only newly written data.
This introduces new `last_scrubbed_txg` property, indicating the
transaction group (TXG) up to which the most recent scrub operation
has checked and repaired the dataset, so users can run scrub only
from the last saved point. We use a scn_max_txg and scn_min_txg
which are already built into scrub, to accomplish that.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Closes#16301
Direct I/O implementation added condition to call dbuf_undirty()
only in case of block cloning. But the condition is not right if
the block is no longer dirty in this TXG, but still in DB_NOFILL
state. It resulted in block not reverting to DB_UNCACHED and
following NULL de-reference on attempt to access absent db_data.
While there, add assertions for db_data to make debugging easier.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16829
This allowed to debug #16714, fixed in #16782. Without assertions
added here it is difficult to figure out what logs cause the problem,
since the assertion happens in sync thread context.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes#16795
This fixes assertion in brt_sync_table() on debug builds when last
cloned block on the vdev is freed and bv_meta_dirty is cleared,
while bv_entcount_dirty is not. Should not matter in production.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16791
Previous implementation of zap_leaf_array_free() put chunks on the
free list in reverse order. Also zap_leaf_transfer_entry() and
zap_entry_remove() were freeing name and value arrays in reverse
order. Together this created a mess in the free list, making
following allocations much more fragmented than necessary.
This patch re-implements zap_leaf_array_free() to keep existing
chunks order, and implements non-destructive zap_leaf_array_copy()
to be used in zap_leaf_transfer_entry() to allow properly ordered
freeing name and value arrays there and in zap_entry_remove().
With this change test of some writes and deletes shows percent of
non-contiguous chunks in DDT reducing from 61% and 47% to 0% and
17% for arrays and frees respectively. Sure some explicit sorting
could do even better, especially for ZAPs with variable-size arrays,
but it would also cost much more, while this should be very cheap.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16766
zio_delay_interrupt(), apparently used for fault injection, is executed
in the I/O pipeline. It can cause the calling thread to go to sleep,
which is not allowed on FreeBSD. This happens only for small delays,
though, and there's no apparent reason to avoid deferring to a taskqueue
in that case, as it already does otherwise.
Simply go to sleep unconditionally. This fixes an occasional panic I
see when running the ZTS on FreeBSD. Also remove an unhelpful comment
referencing the non-existent timeout_generic().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#16785
- With both pending and current AVL-trees being per-vdev and having
effectively identical comparison functions (pending tree compared
also birth time, but I don't believe it is possible for them to be
different for the same offset within one transaction group), it
makes no sense to move entries from one to another. Instead inline
dramatically simplified brt_entry_addref() into brt_pending_apply().
It no longer requires bv_lock, since there is nothing concurrent
to it at the time. And it does not need to search the tree for the
previous entries, since it is the same tree, we already have the
entry and we know it is unique.
- Put brt_vdev_lookup() and brt_vdev_addref() into different tree
traversals to avoid false positives in the first due to the second
entcount modifications. It saves dramatic amount of time when a
file cloned first time by not looking for non-existent ZAP entries.
- Remove avl_is_empty(bv_tree) check from brt_maybe_exists(). I
don't think it is needed, since by the time all added entries are
already accounted in bv_entcount. The extra check must be producing
too many false positives for no reason. Also we don't need bv_lock
there, since bv_entcount pointer must be table at this point, and
we don't care about false positive races here, while false negative
should be impossible, since all brt_vdev_addref() have already
completed by this point. This dramatically reduces lock contention
on massive deletes of cloned blocks. The only remaining one is
between multiple parallel free threads calling brt_entry_decref().
- Do not update ZAP if net change for a block over the TXG was 0.
In combination with above it makes file move between datasets as
cheap operation as originally intended if it fits into one TXG.
- Do not allocate vdevs on pool creation or import if it did not
have active block cloning. This allows to save a bit in few cases.
- While here, add proper error handling in brt_load() on pool
import instead of assertions.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16773
Without doing that there is a race window on export when history
log write by completed rebuild dirties transaction beyond final,
triggering assertion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16714Closes#16782
Those values require global atomics to get current hash_elements
values in few of the hottest code paths, while in all the years I
never cared about it. If somebody wants, it should be easy to
get it by periodic sampling, since neither ARC header nor DBUF
counts change so fast that it would be difficult to catch.
For now I've left hash_elements_max kstat for ARC, since it was
used/reported by arc_summary and it would break older versions,
but now it just reports the current value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16759
Compression names actually aren't used in dedup table names, but
checksum names are.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16776
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks. It created lots of lock contention in many workload.
This change makes most of block cloning data structures per-vdev,
which allows to lock them separately. The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader. Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process. There should be
no functional difference aside of some optimizations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16740
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16740
We are doing exactly the same checks around all brt_pending_add().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16740
dsl_free() calls zio_free() to free the block. For most blocks, this
simply calls metaslab_free() without doing any IO or putting anything on
the IO pipeline.
Some blocks however require additional IO to free. This at least
includes gang, dedup and cloned blocks. For those, zio_free() will issue
a ZIO_TYPE_FREE IO and return.
If a huge number of blocks are being freed all at once, it's possible
for dsl_dataset_block_kill() to be called millions of time on a single
transaction (eg a 2T object of 128K blocks is 16M blocks). If those are
all IO-inducing frees, that then becomes 16M FREE IOs placed on the
pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T
object that requires a 20G allocation of resident memory from the
zio_cache. If that can't be satisfied by the kernel, an out-of-memory
condition is raised.
This would be better handled by improving the cases that the
dmu_tx_assign() throttle will handle, or by reducing the overheads
required by the IO pipeline, or with a better central facility for
freeing blocks.
For now, we simply check for the cases that would cause zio_free() to
create a FREE IO, and instead put the block on the pool's freelist. This
is the same place that blocks from destroyed datasets go, and the async
destroy machinery will automatically see them and trickle them out as
normal.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#6783Closes#16708Closes#16722Closes#16697
..., before we make the header or the log block visible to others.
It should fix assertion on allocated space going negative if the
header is freed once the lock is dropped, while the write is still
going.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16040Closes#16743
Currently, even though send_reader_thread prefetches spill block,
do_dump() will not use it and issues its own blocking arc_read. This
causes significant performance degradation when sending datasets with
lots of spill blocks.
For unmodified spill blocks, we also create send_range struct for them
in send_reader_thread and issue prefetches for them. We piggyback them
on the dnode send_range instead of enqueueing them so we don't break
send_range_after check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: david.chen <david.chen@nutanix.com>
Closes#16701
Small block workloads may use a very large number of dirty records.
During simple block cloning test due to BRT still using 4KB blocks
I can easily see up to 2.5M of those used. Before this change
dbuf_dirty_record_t structures representing them were allocated via
kmem_zalloc(), that rounded their size up to 512 bytes.
Introduction of specialized kmem cache allows to reduce the size
from 512 to 408 bytes. Additionally, since override and raw params
in dirty records are mutually exclusive, puting them into a union
allows to reduce structure size down to 368 bytes, increasing the
saving to 28%, that can be a 0.5GB or more of RAM.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16694
I think we've done enough experiments.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16189Closes#16712
Now that we can handle these different alignments, we don't this
workaround.
This reverts commit aefc2da8a5.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16687
While reading some code @grwilson came across the above function that
seemingly had no consumers besides a ztest callback that ensures that
the tx_callback infrastructure works correctly. It turns out that Lustre
is the main (and potentially the only) consumer of this. Refer to
`osd_trans_commit_cb` of `lustre/osd-zfs/osd_handler.c` in the Lustre
repo for more info. Let's add a comment highlighting this before someone
removes it by mistake.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Serapheim Dimitropoulos <serapheimd@gmail.com>
Closes#16698
If on the first open device's logical ashift is bigger than set
by pool's ashift property, ignore the last as unusable instead of
creating vdev that will fail most of I/Os due to misalignment.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16690
in some cases not linearizing buffers with disk sector crossing a
page boundary. It is fine for hardware, but somehow required by LUKS.
It is not typical for ZFS to produce such buffers, but it may happen
if 6KB block is compressed to 4KB, while still having 2KB alignment.
Banning the 6KB buffers helps vdevs with ashifh=12.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
This partially reverts commit 41210597. Now that b4e4cbeb2 has
been merged Direct IO can be enabled by default for Linux, but
for FreeBSD there still remains a potentially insufficient range
locking in zfs_getpages() which needs to be resolved.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#16629
This fixes an oversight in the Direct I/O PR. There is nothing that
stops a process from manipulating the contents of a buffer for a
Direct I/O read while the I/O is in flight. This can lead checksum
verify failures. However, the disk contents are still correct, and this
would lead to false reporting of checksum validation failures.
To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.
Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO column. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes
2. dio_verify_rd -> Checksum verification failure for reads.
This allows for determining what I/O operation was the culprit for the
checksum verification failure. All DIO errors are reported only on the
top-level VDEV.
Even though FreeBSD can write protect pages (stable pages) it still has
the same issue as Linux with Direct I/O reads.
This commit updates the following:
1. Propogates checksum failures for reads all the way up to the
top-level VDEV.
2. Reports errors through zpool status -d as DIO.
3. Has two zevents for checksum verify errors with Direct I/O. One for
read and one for write.
4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and
handle ABD buffer contents validation the same as Linux.
5. Updated manipulate_user_buffer.c to also manipulate a buffer while a
Direct I/O read is taking place.
6. Adds a new ZTS test case dio_read_verify that stress tests the new
code.
7. Updated man pages.
8. Added an IMPLY statement to zio_checksum_verify() to make sure that
Direct I/O reads are not issued as speculative.
9. Removed self healing through mirror, raidz, and dRAID VDEVs for
Direct I/O reads.
This issue was first observed when installing a Windows 11 VM on a ZFS
dataset with the dataset property direct set to always. The zpool
devices would report checksum failures, but running a subsequent zpool
scrub would not repair any data and report no errors.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#16598
`zvol_rename_minors()` needs to be given the full path not just the
snapshot name. Use code removed in a0bd735ad as a guide
to providing the necessary values.
Add ZTS check for /dev changes after snapshot rename. After
renaming a snapshot with 'snapdev=visible' ensure that the /dev
entries are updated to reflect the rename.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: James Dingwall <james@dingwall.me.uk>
Closes#14223Closes#16600
Since arc_evict() run can take some time, arc_c change during it
may result in undesired shift in ARC states balance. Primarily in
case of arc_c reduction it may cause eviction from MFU data state
despite its being below the target already. Instead we should
evict as originally planned and if needed do another round after.
Reviewed-by: Theera K. <tkittich@hotmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16576Closes#16605
Restart a resilver from scratch, if the current one in progress is
below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).
The original rationale for deferring additional resilvers, when there is
already one in progress, was to help achieving data redundancy sooner
for the data that gets scanned at the end of the resilver.
But in case the admin wants to attach multiple disks to a single vdev,
it wasn't immediately obvious the admin is supposed to run
`zpool resilver` afterwards to reset the deferred resilvers and start
a new one from scratch.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#15810
In a4b21eadec we added the zap_micro_max_size tuneable to raise the size
at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block)
ZAPs. Before this, a microZAP was limited to 128KiB, which was the old
largest block size. The side effect of raising the max size past 128KiB
is that it be stored in a large block, requiring the large_blocks
feature.
Unfortunately, this means that a backup stream created without the
--large-block (-L) flag to zfs send would split the microZAP block into
smaller blocks and send those, as is normal behaviour for large blocks.
This would be received correctly, but since microZAPs are limited to the
first block in the object by definition, the entries in the later blocks
would be inaccessible. For directory ZAPs, this gives the appearance of
files being lost.
This commit adds a feature flag, large_microzap, that must be enabled
for microZAPs to grow beyond 128KiB, and which will be activated the
first time that occurs. This feature is later checked when generating
the stream and if active, the send operation will abort unless
--large-block has also been requested.
Changing the limit still requires zap_micro_max_size to be changed. The
state of this flag effectively sets the upper value for this tuneable,
that is, if the feature is disabled, the tuneable will be clamped to
128KiB.
A stream flag is also added to ensure that the receiver also activates
its own feature flag upon receiving the stream. This is not strictly
necessary to _use_ the received microZAP, since it doesn't care how
large its block is, but it is required to send the microZAP object on,
otherwise the original problem occurs again.
Because it's difficult to reliably distinguish a microZAP from a fatZAP
from outside the ZAP code, and because it seems unlikely that most
users are affected (a fairly niche tuneable combined with what should be
an uncommon use of send), and for the sake of expediency, this change
activates the feature the first time a microZAP grows to use a large
block, and is never deactivated after that. This can be improved in the
future.
This commit changes nothing for existing pools that already have large
microZAPs. The feature will not be retroactively applied, but will be
activated the next time a microZAP grows past the limit.
Don't use large_blocks feature for enable/disable tests. The
large_microzap depends on large_blocks, so it gets enabled as a
dependency, breaking the test. Instead use feature "longname", which has
the exact same feature characteristics.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16593
While some remaining issues are resolved with the recently merged
Direct IO functionality disable it by default.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#16597
In some environments, just making the .zfs control dir hidden from sight
might not be enough. In particular, the following scenarios might
warrant not allowing access at all:
- old snapshots with wrong permissions/ownership
- old snapshots with exploitable setuid/setgid binaries
- old snapshots with sensitive contents
Introducing a new 'disabled' value that not only hides the control dir,
but prevents access to its contents by returning ENOENT solves all of
the above.
The new property value takes advantage of 'iuv' semantics ("ignore
unknown value") to automatically fall back to the old default value when
a pool is accessed by an older version of ZFS that doesn't yet know
about 'disabled' semantics.
I think that technically the zfs_dirlook change is enough to prevent
access, but preventing lookups and dir entries in an already opened .zfs
handle might also be a good idea to prevent races when modifying the
property at runtime.
Add zfs_snapshot_no_setuid parameter to control whether automatically
mounted snapshots have the setuid mount option set or not.
this could be considered a partial fix for one of the scenarios
mentioned in desired.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Co-authored-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes#3963Closes#16587
This patch adds the ability for zfs to support file/dir name up to 1023
bytes. This number is chosen so we can support up to 255 4-byte
characters. This new feature is represented by the new feature flag
feature@longname.
A new dataset property "longname" is also introduced to toggle longname
support for each dataset individually. This property can be disabled,
even if it contains longname files. In such case, new file cannot be
created with longname but existing longname files can still be looked
up.
Note that, to my knowledge native Linux filesystems don't support name
longer than 255 bytes. So there might be programs not able to work with
longname.
Note that NFS server may needs to use exportfs_get_name to reconnect
dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So
NFS may not work when longname is enabled.
Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even
though we add code to support it here, it won't actually work.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#15921
This patch is preparatory work for long name feature. It changes all
users of zap_attribute_t to allocate it from kmem instead of stack. It
also make zap_attribute_t and zap_name_t structure variable length.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#15921
Specifically, a child in a replacing vdev won't count when assessing
the dtl during a vdev_fault()
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16569
ZIL log record structs (lr_XX_t) are frequently allocated with extra
space after the struct to carry variable-sized "payload" items.
Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime
bounds checking on memcpy() calls. Because these types had no indicator
that they might use more space than their simple definition,
__fortify_memcpy_chk will frequently complain about overruns eg:
memcpy: detected field-spanning write (size 7) of single field
"lr + 1" at zfs_log.c:425 (size 0)
memcpy: detected field-spanning write (size 9) of single field
"(char *)(lr + 1)" at zfs_log.c:593 (size 0)
memcpy: detected field-spanning write (size 4) of single field
"(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
memcpy: detected field-spanning write (size 7) of single field
"lr + 1" at zfs_log.c:425 (size 0)
memcpy: detected field-spanning write (size 9) of single field
"(char *)(lr + 1)" at zfs_log.c:593 (size 0)
memcpy: detected field-spanning write (size 4) of single field
"(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
memcpy: detected field-spanning write (size 7) of single field
"lr + 1" at zfs_log.c:425 (size 0)
memcpy: detected field-spanning write (size 9) of single field
"(char *)(lr + 1)" at zfs_log.c:593 (size 0)
memcpy: detected field-spanning write (size 4) of single field
"(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
To fix this, this commit adds flex array fields to all lr_XX_t structs
that require them, and then uses those fields to access that
end-of-struct area rather than more complicated casts and pointer
addition.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16501Closes#16539
Since dsl_crypto_key_open() references the key, 0d23f5e2e4 should
have called dsl_crypto_key_rele() to drop it first instead of
calling dsl_crypto_key_free() directly. The final result should
actually be the same, but without triggering dck_holds assertion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16567
Without updating 'm' we evict from MFU metadata all that we wanted
to evict from all metadata, including already evicted MRU metadata
('m' is the total amount of metadata we had at the beginning,
and 'w' is the total amount of metadata we want to have).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Theera K. <tkittich@hotmail.com>
Closes#16521Closes#16546
On compression we could be more explicit here for cases
where we can not recompress the data.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes#9416
ZLE compressor needs additional bytes to process
d_len argument efficiently.
Don't use BPE_PAYLOAD_SIZE as d_len with it
before we rework zle compressor somehow.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes#9416
Now default compression is lz4, which can stop
compression process by itself on incompressible data.
If there are additional size checks -
we will only make our compressratio worse.
New usable compression thresholds are:
- less than BPE_PAYLOAD_SIZE (embedded_data feature);
- at least one saved sector.
Old 12.5% threshold is left to minimize affect
on existing user expectations of CPU utilization.
If data wasn't compressed - it will be saved as
ZIO_COMPRESS_OFF, so if we really need to recompress
data without ashift info and check anything -
we can just compress it with zero threshold.
So, we don't need a new feature flag here!
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes#9416
For now, userspace has no znode implementation. Some of the property and
path handling code is used there though and is the same on all
platforms, so we only need a single copy of it.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#16492
So that we can get actual benefit from last commit.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#16131Closes#16483
As mentioned in PR #16131, replacing CRC-based hash with cityhash4
could slightly improve the performance by eliminating memory access.
Replacing algorightm is safe since the hash result is not persisted.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#16131Closes#16483
zfs_dbgmsg() does not need newline at the end of the message.
While there, slightly update/sync FreeBSD __dprintf().
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16536
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes#10018