On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error. The range may be for an entire file. While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.
As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.
Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning. This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned. This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller. However, the destination file range is left in an undefined
state.
A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15728Closes#15842
When spare or l2cache (aux) vdev is added during pool creation,
spa->spa_uberblock is not dumped until that point. Subsequently,
the aux label is never synchronized after its initial creation,
resulting in the uberblock label remaining undumped. The uberblock
is crucial for lib_blkid in identifying the ZFS partition type. To
address this issue, we now ensure sync of the uberblock label once
if it's not dumped initially.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15737
Track history in context of bursts, not individual log blocks. It
allows to not blow away all the history by single large burst of
many block, and same time allows optimizations covering multiple
blocks in a burst and even predicted following burst. For each
burst account its optimal block size and minimal first block size.
Use that statistics from the last 8 bursts to predict first block
size of the next burst.
Remove predefined set of block sizes. Allocate any size we see fit,
multiple of 4KB, as required by ZIL now. With compression enabled
by default, ZFS already writes pretty random block sizes, so this
should not surprise space allocator any more.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15635
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15665
While evicting dbufs of a dnode, a marker node is added to the AVL.
The marker node should be inserted in AVL tree ahead of the dbuf its
trying to delete. The blkid and level is used to ensure this. However,
this could go wrong there's another dbufs with the same blkid and level
in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare()
could fail to give the right location or could cause confusion and
trigger ASSERTs.
To ensure that the marker is inserted before the deleting dbuf, use
the pointer value of the original dbuf for comparision.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sanjeev Bagewadi <sanjeev.bagewadi@nutanix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#12482Closes#15643
Detail the import progress of log spacemaps as they can take a very
long time. Also grab the spa_note() messages to, as they provide
insight into what is happening
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes#15539
When two datasets share the same master encryption key, it is safe
to clone encrypted blocks. Currently only snapshots and clones
of a dataset share with it the same encryption key.
Added a test for:
- Clone from encrypted sibling to encrypted sibling with
non encrypted parent
- Clone from encrypted parent to inherited encrypted child
- Clone from child to sibling with encrypted parent
- Clone from snapshot to the original datasets
- Clone from foreign snapshot to a foreign dataset
- Cloning from non-encrypted to encrypted datasets
- Cloning from encrypted to non-encrypted datasets
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Kay Pedersen <mail@mkwg.de>
Closes#15544
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay. Later zil_free_clone_range() drops them after replay or on
dataset destroy. The total balance is neutral. It means on actual
replay we must take additional references, which would stay in BRT.
Without this blocks could be freed prematurely when either original
file or its clone are destroyed. I've observed BRT being emptied
and the feature being deactivated after ZIL replay completion, which
should not have happened. With the patch I see expected stats.
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15603
So that zdb (and others!) can get at the BRT on-disk structures.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15541
Entries in the dbuf cache contribute only the size of the dbuf data to
the cache size. Attached "user" data is not counted. This can lead to
the data currently "owned" by the cache consuming more memory accounting
appears to show. In some cases (eg a metadnode data block with all child
dnode_t slots allocated), the actual size can be as much as 3x as what
the cache believes it to be.
This is arguably correct behaviour, as the cache is only tracking the
size of the dbuf data, not even the overhead of the dbuf_t. On the other
hand, in the above case of dnodes, evicting cached metadnode dbufs is
the only current way to reclaim the dnode objects, and can lead to the
situation where the dbuf cache appears to be comfortably within its
target memory window and yet is holding enormous amounts of slab memory
that cannot be reclaimed.
This commit adds a facility for a dbuf user to artificially inflate the
apparent size of the dbuf for caching purposes. This at least allows for
cache tuning to be adjusted to match something closer to the real memory
overhead.
metadnode dbufs carry a >1KiB allocation per dnode in their user data.
This informs the dbuf cache machinery of that fact, allowing it to make
better decisions when evicting dbufs.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15511
This gets around UBSAN errors when using arrays at the end of
structs. It converts some zero-length arrays to variable length
arrays and disables UBSAN checking on certain modules.
It is based off of the patch from #15460.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Co-authored-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #15145Closes#15510
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#15022
Add a dataset_kstats_rename function, and call it when renaming
a zvol on FreeBSD and Linux.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15482Closes#15486
Currently vdev_queue_class_length is responsible for checking how long
the queue length is, however, it doesn't check the length when a list
is used, rather it just returns whether it is empty or not. To fix this
I added a counter variable to vdev_queue_class to keep track of the sync
IO ops, and changed vdev_queue_class_length to reference this variable
instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes#15478
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).
There are two problems:
1. Each ZVOL in a pool is a separate dataset/objset having a single
dnode. This means the objsets are synchronized serially, which
leads to a bottleneck of ~330K blocks written per second per pool.
2. In the case of multiple dirty dnodes/files on a dataset/objset on a
big system they will be sync'd in parallel taskq threads. However,
it is inefficient to to use 75% of CPU cores of a big system to do
that, because of (a) bottlenecks on a single write issue taskq, and
(b) allocation throttling. In addition, if not for the allocation
throttling sorting write requests by bookmarks (logical address),
writes for different files may reach space allocators interleaved,
leading to unwanted fragmentation.
The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15197
The change in the zvol readonly property does not update the block
device readonly entry until the first IO to the ZVOL. This patch
addresses the issue by updating the block device readonly property
from the set property IOCTL call.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
Currently, zvol threading can be switched through the zvol_request_sync
module parameter system-wide. By making it a zvol property, zvol
threading can be switched per zvol.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
zvol_set_volmode() and zvol_set_snapdev() share a common code path.
Merge this shared code path into zvol_set_common().
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
There is no sense to have separate implementations for FreeBSD and
Linux. Make Linux code shared as more functional and just register
FreeBSD-specific prune callback with arc_add_prune_callback() API.
Aside of code cleanup this should fix excessive pruning on FreeBSD:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15456
RAIDZ parity is calculated by adding data one column at a time. It
works OK for small blocks, but for large blocks results of previous
addition may already be evicted from CPU caches to main memory, and
in addition to extra memory write require extra read to get it back.
This patch splits large parity operations into 64KB chunks, that
should in most cases fit into CPU L2 caches from the last decade.
I haven't touched more complicated cases of data reconstruction to
not over complicate the code. Those should be relatively rare.
My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show
up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/
RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache
should see the effect even on smaller blocks. Wider vdevs may need
bigger blocks to be affected.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15448
ZVOL:
- Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only
one object, it makes no sense to maintain async queue and on each
commit merge it into sync. Single sync queue is just cheaper, while
it changes nothing until actual commit request arrives.
- Remove zsd_sync_cnt and the zil_async_to_sync() calls since we
are no longer switching between sync and async queues.
ZFS:
- Mark write transactions as sync based only on number of sync
opens (z_sync_cnt). We can not randomly jump between sync and
async unless we want data corruptions due to writes reordering.
- When file first opened with O_SYNC (z_sync_cnt incremented to 1)
call zil_async_to_sync() for it to preserve correct ordering between
past and future writes.
- Drop zfs_fsyncer_key logic. Looks like it was an optimization
for workloads heavily intermixing async writes with tons of fsyncs.
But first it was broken 8 years ago due to Linux tsd implementation
not allowing data storage between syscalls, and second, I doubt it
is safe to switch from async to sync so often and without calling
zil_async_to_sync().
- Rename sync argument of *_log_write() into commit, now only
signalling caller's intent to call zil_commit() soon after. It
allows WR_COPIED optimizations without extra other meanings.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15366
Many long-running ZFS ioctls lock the spa_namespace_lock, forcing
concurrent ioctls to sleep for the mutex. Previously, the only
option is to call mutex_enter() which sleeps uninterruptibly. This
is a usability issue for sysadmins, for example, if the admin runs
`zpool status` while a slow `zpool import` is ongoing, the admin's
shell will be locked in uninterruptible sleep for a long time.
This patch resolves this admin usability issue by introducing
mutex_enter_interruptible() which sleeps interruptibly while waiting
to acquire a lock. It is implemented for both Linux and FreeBSD.
The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to
use this new macro so that the command can be interrupted if it is
issued during a concurrent `zpool import` (or other long-running
operation).
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Thomas Bertschinger <bertschinger@lanl.gov>
Closes#15360
zio_root() has no arguments for ready callback or parent ZIO. Except
one recent case in ZIL code if root ZIOs ever have a parent it is
also a root ZIO. It means we do not need READY pipeline stage for
them, which takes some time to process, but even more time to wait
for the children and be woken by them, and both for no good reason.
The most visible effect of this change is that it avoids one taskq
wakeup per ZIL block written, previously used to run zio_ready()
for lwb_root_zio and skipped now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15398
... by checking that previous block is fully written and flushed.
It allows to skip commit delays since we can give up on aggregation
in that case. This removes zil_min_commit_timeout parameter, since
for single-threaded workloads it is not needed at all, while on very
fast devices even some multi-threaded workloads may get detected as
single-threaded and still bypass the wait. To give multi-threaded
workloads more aggregation chances increase zfs_commit_timeout_pct
from 5 to 10%, as they should suffer less from additional latency.
Also single-threaded workloads detection allows in perspective better
prediction of the next block size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15381
ZFS prefetch is currently governed by the zfs_prefetch_disable
tunable. However, this is a module-wide settings - if a specific
dataset benefits from prefetch, while others have issue with it,
an optimal solution does not exists.
This commit introduce the "prefetch" tri-state property, which enable
granular control (at dataset/volume level) for prefetching.
This patch does not remove the zfs_prefetch_disable, which remains
a system-wide switch for enable/disable prefetch. However, to avoid
duplication, it would be preferable to deprecate and then remove
the module tunable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Co-authored-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15237Closes#15436
We already use ____cacheline_aligned in many places, so add one more
instead of seems arbitrary char tc_pad[8].
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15402
This includes random small tweaks, primarily a build fixes, required
when ZFS is built as part of FreeBSD base.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15368
Before this change ZFS created threads for 50% of CPUs for each top-
level vdev. Plus it created the same number of threads for embedded
log groups (that have only one metaslab and don't need any preload).
As result, on system with 80 CPUs and pool of 60 vdevs this resulted
in 4800 metaslab preload threads, that is absolutely insane.
This patch changes the preload threads to 50% of CPUs in one taskq
per pool, so on the mentioned system it will be only 40 threads.
Among other things this fixes zdb on the mentioned system and pool
on FreeBSD, that failed to create so many threads in one process.
Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15319
In most cases we do not care about exact number of buffers linked
to the header, we just need to know if it is zero, non-zero or one.
That can easily be checked just looking on b_buf pointer or in some
cases derefencing it.
b_ebufcnt is read only once, and in that case we already traverse
the list as part of arc_buf_remove(), so second traverse should not
be expensive.
This reduces L1 ARC header size by 8 bytes and full crypto header by
16 bytes, down to 176 and 232 bytes on FreeBSD respectively.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15350
Earlier as part of #14123 I've removed one use of b_cv. This patch
reuses the same approach to remove the other one from much more
rare code path.
This saves 16 bytes of L1 ARC header on FreeBSD (reducing it from
200 to 184 bytes) and seems even more on Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15340
ZFS historically has had several space allocators that were
dynamically selectable. While these have been retained in
OpenZFS, only a single allocator has been statically compiled
in. This patch compiles all allocators for OpenZFS and provides
a module parameter to allow for manual selection between them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15218
Currently redaction bookmarks and their associated redaction lists
have a relatively low limit of 36 redaction snapshots. This is imposed
by the number of snapshot GUIDs that fit in the bonus buffer of the
redaction list object. While this is more than enough for most use
cases, there are some limited cases where larger numbers would be
useful to support.
We tweak the redaction list creation code to use a spill block if
the number of redaction snapshots is above the amount that would fit
in the bonus buffer. We also make a small change to allow spill blocks
to be use for types of data besides SA. In order to fully leverage
this logic, we also change the redaction code to use vmem_alloc, to
handle extremely large allocations if needed. Finally, small tweaks
were made to the zfs commands and the test suite.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15018
The previous patch #14841 appeared to have significant flaw, causing
deadlocks if zl_get_data callback got blocked waiting for TXG sync. I
already handled some of such cases in the original patch, but issue
#14982 shown cases that were impossible to solve in that design.
This patch fixes the problem by postponing log blocks allocation till
the very end, just before the zios issue, leaving nothing blocking after
that point to cause deadlocks. Before that point though any sleeps are
now allowed, not causing sync thread blockage. This require slightly
more complicated lwb state machine to allocate blocks and issue zios
in proper order. But with removal of special early issue workarounds
the new code is much cleaner now, and should even be more efficient.
Since this patch uses null zios between write, I've found that null
zios do not wait for logical children ready status in zio_ready(),
that makes parent write to proceed prematurely, producing incorrect
log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children()
fixes it.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15122
In most cases dmu_sync() works with dirty records directly and does
not need actual data. The only exception is dmu_sync_late_arrival().
To save some CPU time use dmu_buf_hold_noread*() in z*_get_data()
and explicitly call dbuf_read() in dmu_sync_late_arrival(). There
is also a chance that by that time TXG will already be synced and
we won't have to do it at all.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15153
- Split dmu_prefetch_dnode() from dmu_prefetch() into a separate
function. It is quite inconvenient to read the code where len = 0
means dnode prefetch instead indirect/data prefetch. One function
doing both has no benefits, since the code paths are independent.
- Improve dmu_prefetch() handling of long block ranges. Instead
of limiting L0 data length to prefetch for to dmu_prefetch_max,
make dmu_prefetch_max limit the actual amount of prefetch at the
specified level, and, if there is more, prefetch all the rest at
higher indirection level. It should improve random access times
within the prefetched range of any length, reducing importance of
specific dmu_prefetch_max value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15076
This gives `zdb -b` support for clone blocks.
Previously, it didn't know what clones were, so would count their space
allocation multiple times and then report leaked space (or, in debug,
would assert trying to claim blocks a second time).
This commit fixes those bugs, and reports the number of clones and the
space "used" (saved) by them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15123
Fastwrite was introduced many years ago to improve ZIL writes spread
between multiple top-level vdevs by tracking number of allocated but
not written blocks and choosing vdev with smaller count. It suposed
to reduce ZIL knowledge about allocation, but actually made ZIL to
even more actively report allocation code about the allocations,
complicating both ZIL and metaslabs code.
On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync()
was lost many years ago, that was one of the declared benefits. Plus
introduction of embedded log metaslab class solved another problem
with allocation rotor accounting both normal and log allocations,
since in most cases those are now in different metaslab classes.
After all that, I'd prefer to simplify already too complicated ZIL,
ZIO and metaslab code if the benefit of complexity is not obvious.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15107
- Reduce maximum prefetch distance for 32bit platforms to 8MB as it
was previously. Those systems didn't grow much probably, so better
stay conservative there.
- Retire array_rd_sz tunable, blocking prefetch for large requests.
We should not penalize applications trying to be more efficient. The
speculative prefetcher by itself has reasonable distance limits, and
1MB is not much at all these days.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15072
To simplify error handling bpobj_iterate_blkptrs() iterates through
the list of block pointers backwards. Unfortunately speculative
prefetcher is currently unable to detect such patterns, that makes
each block read there synchronous and very slow on HDD pools.
According to my tests, added explicit prefetch reduces time needed
to asynchronously delete 8 snapshots of 4 million blocks each from
20 seconds to less than one, that should free sync thread for other
useful work, such as async writes, scrub, etc.
While there, plug one memory leak in case of bpobj_open() error and
harmonize some variable names.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15071
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely. So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match". The harm in sending them is simply that they bloat the
event. In particular, on FreeBSD the event must fit into a 1016 byte
buffer.
Fixes#14717 for mirrored pools.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#14717Closes#15052
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete. With modern storage hardware, they will
almost always look like white noise; all bits will be wrong. They only
serve to bloat the event. That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.
This fixes issue #14717 for RAIDZ pools, but not for mirror pools.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15052
Since spa_min_alloc may not be a power of 2, unlike ashifts, in the
case of DRAID, we should not select the minimal value among several
vdevs. Rounding to a multiple of it is unlikely to work for other
vdevs. Instead, using the greatest common divisor produces smaller
yet more reasonable results.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15067
Ashift can be set for a vdev only during its creation, and the
top-level vdev does not change when a vdev is attached or replaced.
The ashift property should not be used during attachment, as it
does not allow attaching/replacing a vdev if the pool's ashift
property is increased after the existing vdev was created. Instead,
we should be able to attach the vdev if the attached vdev can
satisfy the ashift requirement with its parent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15061
As it turns out having autotrim default to 'on' on FreeBSD never really
worked due to mess with defines where userland and kernel module were
getting different default values (userland was defaulting to 'off',
module was thinking it's 'on').
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15079
- Pack struct zio_prop by 4 bytes from 84 to 80.
- Skip new child ZIO locking while linking to parent. The newly
allocated ZIO is not externally visible yet, so nobody should care.
- Skip io_bp_copy writes when not used (write && non-debug).
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14985
Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O. This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed. While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.
This should partially address the same issue as c85ac731a0, but
from a different perspective, complementing it.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15007
Consistently get the proper default value for autotrim.
Currently, only the kernel module is built with IN_FREEBSD_BASE,
and libzfs get the wrong default value, leading to confusion and
incorrect output when autotrim value was not set explicitly.
Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15016
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists. AVL-trees are too expensive
for such a simple task. To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.
To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue. Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member. Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().
Introduce deadline of ~0.5s for LBA-sorted queues. Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation. With this change I no longer see it. I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal. For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.
Arrange members in struct zio to access only one cache line when
searching through vdev queues. While there, remove io_alloc_node,
reusing the io_queue_node instead. Those two are never used same
time.
Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests. Just remove the tree.
Remove locking from txg_all_lists_empty(). It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.
With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14925