In the commit of the head_errlog feature we introduced a bug in
dsl_dataset_promote_sync(): we may dereference origin_head and hds, both
dereferencing ddpa after calling promote_sync() on ddpa.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#16272Closes#16273
At the end of l2arc_evict() fix an assertion in the case that l2ad_hand
+ distance == l2ad_end.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#16202Closes#16207
Depending on kind of error zap_expand_leaf() may return with or
without valid leaf reference held. Make sure it returns NULL if
due to error it has no leaf to return. Make its callers to check
the returned leaf pointer, and release the leaf if it is not NULL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#12366Closes#16159
Previous code overengineered cloned range calculation by using
BP_GET_LSIZE(). The problem is that legacy holes don't have the
logical size, so result will be wrong. But we also don't need
to look on every block size, since they all must be identical.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16165
Code for pools before version 11 uses dmu_objset_find_dp() to scan
for children datasets/clones. It calls enqueue_clones_cb() and
enqueue_cb() callbacks in parallel from multiple taskq threads.
It ends up bad for scan_ds_queue_insert(), corrupting scn_queue
AVL-tree. Fix it by introducing a mutex to protect those two
scan_ds_queue_insert() calls. All other calls are done from the
sync thread and so serialized.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16162
As I understand just for being less predictable dnode hash includes
8 bits of objset pointer, starting at 6. But since objset_t is
more than 1KB in size, its allocations are likely aligned to 2KB,
that means 11 lower bits provide no entropy. Just take the 8 bits
starting from 11.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16131
There is no reason for these module parameters to be read-only.
Being modified they just apply on next pool import/creation, that
is useful for testing different values.
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16118
When compressed ARC is disabled, we may have to re-compress when
writing into L2ARC. If doing so we can't fit it into the original
physical size, we should just fail immediately, since even if it
may still fit into allocation size, its checksum will never match.
While there, refactor the code similar to other compression places
without using abd_return_buf_copy().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16038
In dbuf_read_verify_dnode_crypt():
- We don't need original dbuf locked there. Instead take a lock
on a dnode dbuf, that is actually manipulated.
- Block decryption for a dnode dbuf if it is currently being
written. ARC hash lock does not protect anonymous buffers, so
arc_untransform() is unsafe when used on buffers being written,
that may happen in case of encrypted dnode buffers, since they
are not copied by dbuf_dirty()/dbuf_hold_copy().
In dbuf_read():
- If the buffer is in flight, recheck its compression/encryption
status after it is cached, since it may need arc_untransform().
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16104
In P2ALIGN, the result would be incorrect when align is unsigned
integer and x is larger than max value of the type of align.
In that case, -(align) would be a positive integer, which means
high bits would be zero and finally stay zero after '&' when
align is converted to a larger integer type.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com>
Closes#15940
ZFS prefetch is currently governed by the zfs_prefetch_disable
tunable. However, this is a module-wide settings - if a specific
dataset benefits from prefetch, while others have issue with it,
an optimal solution does not exists.
This commit introduce the "prefetch" tri-state property, which enable
granular control (at dataset/volume level) for prefetching.
This patch does not remove the zfs_prefetch_disable, which remains
a system-wide switch for enable/disable prefetch. However, to avoid
duplication, it would be preferable to deprecate and then remove
the module tunable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Co-authored-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15237Closes#15436
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.
Also allow zpool clear if no evidence of another host
is using the pool.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15839
Detail the import progress of log spacemaps as they can take a very
long time. Also grab the spa_note() messages to, as they provide
insight into what is happening
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes#15539
Currently, zpool add allows users to add top-level vdevs that have
different ashifts but doing so prevents users from being able to
perform a top-level vdev removal. Often times consumers may not realize
that they have mismatched ashifts until the top-level removal fails.
This feature adds ashift validation to the zpool add command and will
fail the operation if the sector size of the specified vdev does not
match the existing pool. This behavior can be disabled by using the -f
flag. In addition, new flags have been added to provide fine-grained
control to disable specific checks. These flags
are:
--allow-in-use
--allow-ashift-mismatch
--allow-replicaton-mismatch
The force flag will disable all of these checks.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mmaybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#15509
- Workaround dangling pointer in uu_list.c (#16124)
- Fix calloc() transposed arguments in zpool_vdev_os.c
- Make some temp variables unsigned to prevent triggering a
'-Werror=alloc-size-larger-than' error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#16124Closes#16125
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay). However, no action is taken by ZED,
like is done for checksum or I/O errors. This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
VDEV_PROP_SLOW_IO_N
VDEV_PROP_SLOW_IO_T
If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15469
Previous code held ARC state sublist lock throughout all L2ARC
write process, which included number of allocations and even ZIO
issues. Being blocked in any of those places the code could also
block ARC eviction, that could cause OOM activation or even dead-
lock if system is low on memory or one is too fragmented.
Fix it by dropping the lock as soon as we see a block eligible
for L2ARC writing and pick it up later using earlier inserted
marker. While there, also reduce scope of hash lock, moving
ZIO allocation and other operations not requiring header access
out of it. All operations requiring header access move under
hash lock, since L2_WRITING flag does not prevent header eviction
only transition to arc_l2c_only state with L1 header.
To be able to manipulate sublist lock and marker as needed add few
more multilist functions and modify one.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16040
When after #16022 adding new range we aggregate more than two
existing ranges, that should be very rare, only if several streams
overlap, we may need to zero not the last range, but some earlier.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16072
Syncing context should not depend on current state of dbuf, which
could already change several times in later transaction groups,
but rely solely on dirty record for the transaction group being
synced. Some of the checks seem already impossible, while instead
of others I think we should better check for absence of data in
the specific dirty record rather than DB_NOFILL.
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16057
Before this change speculative prefetcher was able to detect a stream
only if all of its accesses are perfectly sequential. It was easy to
implement and is perfectly fine for single-threaded applications.
Unfortunately multi-threaded network servers, such as iSCSI, SMB or
NFS usually have plenty of threads and may often reorder requests,
preventing successful speculation and prefetch.
This change allows speculative prefetcher to detect streams even if
requests are reordered by introducing a list of 9 non-contiguous
ranges up to 16MB ahead of current stream position and filling the
gaps as more requests arrive. It also allows stream to proceed
even with holes up to a certain configurable threshold (25%).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16022
Investigating read errors triggering panic fixed in #16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_UNCACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_UNCACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Robert Evans <evansr@google.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16052
Previous code reported non-ZIO errors only via return value, but
not via parent ZIO. It could cause NULL-dereference panics due
to dmu_buf_hold_array_by_dnode() ignoring the return value,
relying solely on parent ZIO status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reported by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16042
- When reading L0 block pointers handle buffers without ones and
without dirty records as a holes. Those appear when dnode size
was increased, but the end was never written, so there are no new
indirection levels to store the pointers. It makes no sense to
return EAGAIN here, since sync won't create new indirection levels
until there will be actual writes.
- When cloning blocks set destination hole logical birth time
to the current TXG. Otherwise if we are cloning over existing
data, newly created holes may not be properly replicated later.
Use BP_SET_BIRTH() when possible to not replicate its logic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15994Closes#16007
Unlike DDT, where ZAP values may have different lengths due to
compression, all BRT entries are identical 8-byte counters. It
does not make sense to first fetch the length only to assert it.
zap_lookup_uint64() is specifically designed to work with counters
of different size and should return error if something odd found.
Calling it straight allows to save some measurable CPU time.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15950
Similar to DDT make BRT data and indirect block sizes configurable
via module parameters. I am not sure what would be the best yet,
but similar to DDT 4KB blocks kill all chances of compression on
vdev with ashift=12 or more, that on my tests reaches 3x.
While here, fix documentation for respective DDT parameters.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15967
Since brt_pending_apply() is running in syncing context, no other
brt_pending_tree accesses are possible for the TXG. We don't need
to acquire brt_pending_lock here.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15955
Before this change ZAP called dnode_hold() for almost every block
access, that was clearly visible in profiler under heavy load, such
as BRT. This patch makes it always hold the dnode reference between
zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode
operations between those. It also adds several new _by_dnode() APIs
to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode()
variant and uses it in the ZAP code.
After this there remains only one call to dmu_buf_dnode_enter(),
which seems to be unneeded. So remove the call and the functions.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15951
If there is a pending entry for this block, then we've already
issued BRT prefetch for it within this TXG, so don't do it again.
BRT vdev lookup and following zap_prefetch_uint64() call can be
pretty expensive and should be avoided when not necessary.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15941
- Remove custom zap_memset(), use regular memset().
- Use PANIC() instead of opaque cmn_err(CE_PANIC).
- Provide entry parameter to zap_leaf_rehash_entry().
- Reduce branching in zap_leaf_array_create() inner loop.
- Remove signedness where it should not be.
Should be no function changes.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15976
It does not look important how exactly brt_pending_tree is sorted.
When cloning large file, it is quite likely that all of its blocks
have identical physical birth times, so comparing them first does
not provide useful entropy, while accesses additional cache line.
In most cases combination of vdev and offset provides unique result
and physical birth time comparison is not even needed. Meanwhile,
when traversing the tree inside brt_pending_apply(), it can be
beneficial for dbuf cache and CPU cache hits to group processing
by vdev and so by the per-VDEV BRT ZAPs.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15954
Before this change resume token was updated only on data receive.
Usually it is enough to resume replication without much overlap.
But we've got a report of a curios case, where replication source
was traversed with recursive grep, which through enabled atime
modified every object without modifying any data. It produced
several gigabytes of replication traffic without a single data
write and so without a single resume point.
While the resume token was not designed to resume from an object,
I've found that the send implementation always sends object before
any data. So by requesting resume from offset 0 we are effectively
resuming from the object, followed (or not) by the data at offset
0, just as we need it.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15927
- Split dmu_prefetch_dnode() from dmu_prefetch() into a separate
function. It is quite inconvenient to read the code where len = 0
means dnode prefetch instead indirect/data prefetch. One function
doing both has no benefits, since the code paths are independent.
- Improve dmu_prefetch() handling of long block ranges. Instead
of limiting L0 data length to prefetch for to dmu_prefetch_max,
make dmu_prefetch_max limit the actual amount of prefetch at the
specified level, and, if there is more, prefetch all the rest at
higher indirection level. It should improve random access times
within the prefetched range of any length, reducing importance of
specific dmu_prefetch_max value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15076
Track history in context of bursts, not individual log blocks. It
allows to not blow away all the history by single large burst of
many block, and same time allows optimizations covering multiple
blocks in a burst and even predicted following burst. For each
burst account its optimal block size and minimal first block size.
Use that statistics from the last 8 bursts to predict first block
size of the next burst.
Remove predefined set of block sizes. Allocate any size we see fit,
multiple of 4KB, as required by ZIL now. With compression enabled
by default, ZFS already writes pretty random block sizes, so this
should not surprise space allocator any more.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15635
- Generalize vdev_nowritecache handling by traversing through the
VDEV tree and skipping children ZIOs where not supported.
- Remove intermediate zio_null() in case of several VDEV children.
- Remove children handling from zio_ioctl(). There are no other
use cases for this code beside DKIOCFLUSHWRITECACHED, and would there
be, I doubt they would so straightforward apply to all VDEV children.
Comparing to removed previous optimization this should improve cases
of redundant ZILs/SLOGs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15515
... by checking that previous block is fully written and flushed.
It allows to skip commit delays since we can give up on aggregation
in that case. This removes zil_min_commit_timeout parameter, since
for single-threaded workloads it is not needed at all, while on very
fast devices even some multi-threaded workloads may get detected as
single-threaded and still bypass the wait. To give multi-threaded
workloads more aggregation chances increase zfs_commit_timeout_pct
from 5 to 10%, as they should suffer less from additional latency.
Also single-threaded workloads detection allows in perspective better
prediction of the next block size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15381
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
already in writeback unless data-integrity sync is requested.
2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
skipped due to DMU pushing back on TX assign.
3) Add missing mmap flush when doing block cloning.
4) While here, pass errors from putpage to writepage/writepages.
This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Robert Evans <evansr@google.com>
Closes#15933Closes#16019
The regular ABD iterators yield data buffers, so they have to map and
unmap pages into kernel memory. If the caller only wants to count
chunks, or can use page pointers directly, then the map/unmap is just
unnecessary overhead.
This adds adb_iterate_page_func, which yields unmapped struct page
instead.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#15533Closes#15588
(cherry picked from commit 390b448726)
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15665
Similar to deduplication, the size of data duplicated by block cloning
should not be included in the slop space calculation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuxin Wang <yuxinwang9999@gmail.com>
Closes#15874
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error. The range may be for an entire file. While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.
As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.
Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning. This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned. This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller. However, the destination file range is left in an undefined
state.
A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15728Closes#15842
During device removal stress tests, we noticed that we were tripping
the assertion that mg_initialized was true. After investigation, it was
determined that the mg in question was the embedded log metaslab
group for a newly added vdev; the normal mg had been initialized (by
metaslab_sync_reassess, via vdev_sync_done). However, because the spa
config alloc lock is not held as writer across both calls to
metaslab_sync_reassess, it is possible for an allocation to happen
between the two metaslab_groups being initialized. Because the metaslab
code doesn't check the group in question, just the vdev's main mg, it
is possible to get past the initial check in vdev_allocatable and
later fail due to the assertion.
We simply remove the assertions. We could also consider locking the
ALLOC lock around the reassess calls in vdev_sync_done, but that risks
deadlocks. We could check the actual target mg in vdev_allocatable,
but that risks racing with a passivation that comes in after that
check but before the assertion. We still won't be able to actually
allocate from the metaslab group if no metaslabs are ready, so this
change shouldn't break anything.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15818
If devid or physpath for a vdev changes between imports, ensure it is
updated to the new value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15816
Pool import logic uses vdev paths, so it makes sense to add path
information on AUX vdev as well.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15737
When spare or l2cache (aux) vdev is added during pool creation,
spa->spa_uberblock is not dumped until that point. Subsequently,
the aux label is never synchronized after its initial creation,
resulting in the uberblock label remaining undumped. The uberblock
is crucial for lib_blkid in identifying the ZFS partition type. To
address this issue, we now ensure sync of the uberblock label once
if it's not dumped initially.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15737
In general, VOPs must not load the "z_log" field until having called
zfs_enter_verify_zp().
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15752
If the destination file is mmaped and the mmaped region was already
read, so it is cached, we need to update mmaped pages after successful
clone using update_pages().
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Pointed out by: Ka Ho Ng <khng@freebsd.org>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#15772
Switch from cv_wait() to cv_wait_idle() in vdev_autotrim_wait_kick(),
which should mitigate the high load average while waiting.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes#15781
For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15719