Private read/write mapping can't be used to modify the mapped files, so
they will remain be immutable. Private read/write mappings are usually
used to load the data segment of executable files, rejecting them will
rendering immutable executable files to stop working.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes#15344
As shown in #15404#issuecomment-1765002181, Ubuntu kernel has
'Provides: zfs-dkms', which will cause uninstall of the kernel, when
attempting to install openzfs-zfs-dkms.
As a workaround remove the 'Conflicts: zfs-dkms' definition from
the debian control file.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mart Frauenlob <AllKind@fastest.cc>
Closes#15503
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#15022
With Linux v6.6.0 and GCC 12, when debug build is configured,
implicit conversion error is raised while converting
'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of
'true' to fix the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15489
Add a ZFS feature flag to indicate OpenZFS availability.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gordon Tetlow <gordon@freebsd.org>
Closes#15484
When retrieving a system attribute, the size of the supplied
buffer is ignored. If the buffer is too small to hold the attribute,
sa_attr_op() will write past the end of the buffer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jking@racktopsystems.com>
Closes#15476
PR#15459 add all read-only compatible zpool features to grub2
compatibility list. 'obsolete_counts' is a read-only features that
depends on 'device_removal' feature which is not read-only and
is marked as ZFEATURE_FLAG_MOS. Creating a pool with grub2
compatibility enables 'device_removal' feature as well, which is
not desired.
This commit removes the 'obsolete_counts' feature from
grub2 compatibility list, as GRUB only supports read-only
compatible features.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15499
Previously taskq_init_ent() was an empty macro, while actual init
was done by taskq_dispatch_ent(). It could be slightly faster in
case taskq never enqueued. But without it taskq_empty_ent() relied
on the structure being zeroed by somebody else, that is not good.
As a side effect this allows the same task to be queued several
times, that is normal on FreeBSD, that may or may not get useful
here also one day.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15455
- Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl
output.
- Use much faster sbuf_cat() instead of sbuf_printf("%s").
Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes
to seconds, making dbufstat almost usable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15495
Add a dataset_kstats_rename function, and call it when renaming
a zvol on FreeBSD and Linux.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15482Closes#15486
Alien does not honour the %posttrans hook.
So move the dkms uninstall/install scripts to the
%pre/%post hooks in case of package install/upgrade.
In case of package removal, handle that in %preun.
Add removal of all old dkms modules.
Add checking for broken 'dkms status'. Handle that as
good as possible and warn the user about it.
Also add more verbose messages about what we are doing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mart Frauenlob <AllKind@fastest.cc>
Closes#15415
Otherwise callbacks may trigger KMSAN violations in the dlen == 0 case.
For example, raidz_syn_pq_abd() will compare an uninitialized pointer
with itself before returning. This seems harmless, but let's maintain
good hygiene and avoid passing uninitialized variables, if only to
placate KMSAN.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15491
There have been rare cases where the VDEV_ENC_SYSFS_PATH value that zed
gets passed is stale. To mitigate this, dynamically check the sysfs
path at the time of zed event processing, and use the dynamic value if
possible. Note that there will be other times when we can not
dynamically detect the sysfs path (like if a disk disappears) and have
to rely on the old value for things like turning on the fault LED. That
is to say, we can't just blindly use the dynamic path in every case.
Also:
- Add enclosure sysfs entry when running 'zpool add'
- Fix 'slot' and 'enc' zpool.d scripts for nvme
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15462
Currently vdev_queue_class_length is responsible for checking how long
the queue length is, however, it doesn't check the length when a list
is used, rather it just returns whether it is empty or not. To fix this
I added a counter variable to vdev_queue_class to keep track of the sync
IO ops, and changed vdev_queue_class_length to reference this variable
instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Closes#15478
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).
There are two problems:
1. Each ZVOL in a pool is a separate dataset/objset having a single
dnode. This means the objsets are synchronized serially, which
leads to a bottleneck of ~330K blocks written per second per pool.
2. In the case of multiple dirty dnodes/files on a dataset/objset on a
big system they will be sync'd in parallel taskq threads. However,
it is inefficient to to use 75% of CPU cores of a big system to do
that, because of (a) bottlenecks on a single write issue taskq, and
(b) allocation throttling. In addition, if not for the allocation
throttling sorting write requests by bookmarks (logical address),
writes for different files may reach space allocators interleaved,
leading to unwanted fragmentation.
The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15197
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Closes#15470
Calling nfs_reset_shares on Linux prints a warning:
`failed to lock /etc/exports.d/zfs.exports.lock: No such file or
directory`
when /etc/exports.d does not exist. The directory gets created, when a
filesystem is actually exported through nfs_toggle_share and
nfs_init_share. The truncation of /etc/exports.d/zfs.exports happens
unconditionally when calling `zfs mount -a` (via zfs_do_mount and
share_mount in `cmd/zfs/zfs_main.c`).
Fixing the issue only in the Linux part, since the exports file on
freebsd is in `/etc/zfs/`, which seems present on 2 FreeBSD systems I
have access to (through `/etc/zfs/compatibility.d/`), while a Debian
box does not have the directory even if `/usr/sbin/exportfs` is
present through the `nfs-kernel-server` package.
The code for exports_available is copied from nfs_available above.
Fixes: ede037cda7
("Make zfs-share service resilient to stale exports")
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes#15369Closes#15468
Block cloning from an encrypted dataset into an unencrypted dataset
and vice versa is not possible. The current code did allow cloning
unencrypted files into an encrypted dataset causing a panic when
these were accessed. Block cloning between encrypted and encrypted
is currently supported on the same filesystem only.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Rob N <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes#15464Closes#15465
GRUB opens the boot pool in read-only mode. All read-only
compatible features for zpool can be enabled and added to
grub2 compatibility, as GRUB does not open the boot-pool
for write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15459
The change in the zvol readonly property does not update the block
device readonly entry until the first IO to the ZVOL. This patch
addresses the issue by updating the block device readonly property
from the set property IOCTL call.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
Currently, zvol threading can be switched through the zvol_request_sync
module parameter system-wide. By making it a zvol property, zvol
threading can be switched per zvol.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
zvol_set_volmode() and zvol_set_snapdev() share a common code path.
Merge this shared code path into zvol_set_common().
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
There is no sense to have separate implementations for FreeBSD and
Linux. Make Linux code shared as more functional and just register
FreeBSD-specific prune callback with arc_add_prune_callback() API.
Aside of code cleanup this should fix excessive pruning on FreeBSD:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15456
We should not always use PAGESIZE alignment for caches bigger than
it and SPA_MINBLOCKSIZE otherwise. Doing that caches for 5, 6, 7,
10 and 14KB rounded up to 8, 12 and 16KB respectively make no sense.
Instead specify as alignment the biggest power-of-2 divisor. This
way 2KB and 6KB caches are both aligned to 2KB, while 4KB and 8KB
are aligned to 4KB.
Reduce number of caches to half-power of 2 instead of quarter-power
of 2. This removes caches difficult for underlying allocators to
fit into page-granular slabs, such as: 2.5, 3.5, 5, 7, 10KB, etc.
Since these caches are mostly used for transient allocations like
ZIOs and small DBUF cache it does not worth being too aggressive.
Due to the above alignment issue some of those caches were not
working properly any way. 6KB cache now finally has a chance to
work right, placing 2 buffers into 3 pages, that makes sense.
Remove explicit alignment in Linux user-space case. I don't think
it should be needed any more with the above fixes.
As result on FreeBSD instead of such numbers of pages per slab:
vm.uma.zio_buf_comb_16384.keg.ppera: 4
vm.uma.zio_buf_comb_14336.keg.ppera: 4
vm.uma.zio_buf_comb_12288.keg.ppera: 3
vm.uma.zio_buf_comb_10240.keg.ppera: 3
vm.uma.zio_buf_comb_8192.keg.ppera: 2
vm.uma.zio_buf_comb_7168.keg.ppera: 2
vm.uma.zio_buf_comb_6144.keg.ppera: 2 <= Broken
vm.uma.zio_buf_comb_5120.keg.ppera: 2
vm.uma.zio_buf_comb_4096.keg.ppera: 1
vm.uma.zio_buf_comb_3584.keg.ppera: 7 <= Hard to free
vm.uma.zio_buf_comb_3072.keg.ppera: 3
vm.uma.zio_buf_comb_2560.keg.ppera: 2
vm.uma.zio_buf_comb_2048.keg.ppera: 1
vm.uma.zio_buf_comb_1536.keg.ppera: 2
vm.uma.zio_buf_comb_1024.keg.ppera: 1
vm.uma.zio_buf_comb_512.keg.ppera: 1
I am now getting such:
vm.uma.zio_buf_comb_16384.keg.ppera: 4
vm.uma.zio_buf_comb_12288.keg.ppera: 3
vm.uma.zio_buf_comb_8192.keg.ppera: 2
vm.uma.zio_buf_comb_6144.keg.ppera: 3 <= Fixed, 2 in 3 pages
vm.uma.zio_buf_comb_4096.keg.ppera: 1
vm.uma.zio_buf_comb_3072.keg.ppera: 3
vm.uma.zio_buf_comb_2048.keg.ppera: 1
vm.uma.zio_buf_comb_1536.keg.ppera: 2
vm.uma.zio_buf_comb_1024.keg.ppera: 1
vm.uma.zio_buf_comb_512.keg.ppera: 1
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15452
RAIDZ parity is calculated by adding data one column at a time. It
works OK for small blocks, but for large blocks results of previous
addition may already be evicted from CPU caches to main memory, and
in addition to extra memory write require extra read to get it back.
This patch splits large parity operations into 64KB chunks, that
should in most cases fit into CPU L2 caches from the last decade.
I haven't touched more complicated cases of data reconstruction to
not over complicate the code. Those should be relatively rare.
My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show
up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/
RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache
should see the effect even on smaller blocks. Wider vdevs may need
bigger blocks to be affected.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15448
ZVOL:
- Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only
one object, it makes no sense to maintain async queue and on each
commit merge it into sync. Single sync queue is just cheaper, while
it changes nothing until actual commit request arrives.
- Remove zsd_sync_cnt and the zil_async_to_sync() calls since we
are no longer switching between sync and async queues.
ZFS:
- Mark write transactions as sync based only on number of sync
opens (z_sync_cnt). We can not randomly jump between sync and
async unless we want data corruptions due to writes reordering.
- When file first opened with O_SYNC (z_sync_cnt incremented to 1)
call zil_async_to_sync() for it to preserve correct ordering between
past and future writes.
- Drop zfs_fsyncer_key logic. Looks like it was an optimization
for workloads heavily intermixing async writes with tons of fsyncs.
But first it was broken 8 years ago due to Linux tsd implementation
not allowing data storage between syscalls, and second, I doubt it
is safe to switch from async to sync so often and without calling
zil_async_to_sync().
- Rename sync argument of *_log_write() into commit, now only
signalling caller's intent to call zil_commit() soon after. It
allows WR_COPIED optimizations without extra other meanings.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15366
Prefetched buffers are currently read from L2ARC if, and only if,
l2arc_noprefetch is set to non-default value of 0. This means that
a streaming read which can be served from L2ARC will instead engage
the main pool.
For example, consider what happens when a file is sequentially read:
- application requests contiguous data, engaging the prefetcher;
- ARC buffers are initially marked as prefetched but, as the calling
application consumes data, the prefetch tag is cleared;
- these "normal" buffers become eligible for L2ARC and are copied to it;
- re-reading the same file will *not* engage L2ARC even if it contains
the required buffers;
- main pool has to suffer another sequential read load, which (due to
most NCQ-enabled HDDs preferring sequential loads) can dramatically
increase latency for uncached random reads.
In other words, current behavior is to write data to L2ARC (wearing it)
without using the very same cache when reading back the same data. This
was probably useful many years ago to preserve L2ARC read bandwidth but,
with current SSD speed/size/price, it is vastly sub-optimal.
Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads,
means that even prefetched but unused buffers will be copied into L2ARC,
further increasing wear and load for potentially not-useful data.
This patch enable prefetched buffer to be read from L2ARC even when
l2arc_noprefetch=1 (default), increasing sequential read speed and
reducing load on the main pool without polluting L2ARC with not-useful
(ie: unused) prefetched data. Moreover, it clear users confusion about
L2ARC size increasing but not serving any IO when doing sequential
reads.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15451
Many long-running ZFS ioctls lock the spa_namespace_lock, forcing
concurrent ioctls to sleep for the mutex. Previously, the only
option is to call mutex_enter() which sleeps uninterruptibly. This
is a usability issue for sysadmins, for example, if the admin runs
`zpool status` while a slow `zpool import` is ongoing, the admin's
shell will be locked in uninterruptible sleep for a long time.
This patch resolves this admin usability issue by introducing
mutex_enter_interruptible() which sleeps interruptibly while waiting
to acquire a lock. It is implemented for both Linux and FreeBSD.
The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to
use this new macro so that the command can be interrupted if it is
issued during a concurrent `zpool import` (or other long-running
operation).
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Thomas Bertschinger <bertschinger@lanl.gov>
Closes#15360
Commits 518b487 and 23bdb07 changed the default ARC size limit on
Linux systems to 1/2 of physical memory, which has become too
strict for modern systems with large amounts of RAM. This patch
changes the default limit to match that of FreeBSD, so ZFS may
have a unified value on both platforms.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes#15437
zio_root() has no arguments for ready callback or parent ZIO. Except
one recent case in ZIL code if root ZIOs ever have a parent it is
also a root ZIO. It means we do not need READY pipeline stage for
them, which takes some time to process, but even more time to wait
for the children and be woken by them, and both for no good reason.
The most visible effect of this change is that it avoids one taskq
wakeup per ZIL block written, previously used to run zio_ready()
for lwb_root_zio and skipped now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15398
This reverts commit aefb6a2bd6.
aefb6a2bd temporally disabled blk-mq until we could fix a fix for
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15439
This fix removes a dubious optimization in zfs_uiomove_bvec_rq()
that saved the iterator contents of a rq_for_each_segment(). This
optimization allowed restoring the "saved state" from a previous
rq_for_each_segment() call on the same uio so that you wouldn't
need to iterate though each bvec on every zfs_uiomove_bvec_rq() call.
However, if the kernel is manipulating the requests/bios/bvecs under
the covers between zfs_uiomove_bvec_rq() calls, then it could result
in corruption from using the "saved state". This optimization
results in an unbootable system after installing an OS on a zvol
with blk-mq enabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15351
... by checking that previous block is fully written and flushed.
It allows to skip commit delays since we can give up on aggregation
in that case. This removes zil_min_commit_timeout parameter, since
for single-threaded workloads it is not needed at all, while on very
fast devices even some multi-threaded workloads may get detected as
single-threaded and still bypass the wait. To give multi-threaded
workloads more aggregation chances increase zfs_commit_timeout_pct
from 5 to 10%, as they should suffer less from additional latency.
Also single-threaded workloads detection allows in perspective better
prediction of the next block size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15381
Once we verified the ABDs and asserted the sizes we should never
see premature ABDs ends. Assert that and remove extra branches
from production builds.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15428
ZFS prefetch is currently governed by the zfs_prefetch_disable
tunable. However, this is a module-wide settings - if a specific
dataset benefits from prefetch, while others have issue with it,
an optimal solution does not exists.
This commit introduce the "prefetch" tri-state property, which enable
granular control (at dataset/volume level) for prefetching.
This patch does not remove the zfs_prefetch_disable, which remains
a system-wide switch for enable/disable prefetch. However, to avoid
duplication, it would be preferable to deprecate and then remove
the module tunable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Co-authored-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15237Closes#15436
The first occurrence should be "ARC prefetch data accesses:"
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: ofthesun9 <olivier@ofthesun.net>
Closes#15427
This reverts commit 797f55ef12 which
was causing a VERIFY failure when running the project quota tests.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15438
spa_thread() and the "System Duty Cycle" scheduling class are from
Illumos and have not yet been adapted to Linux or FreeBSD.
HAVE_SPA_THREAD has long been explicitly undefined and used to mark
spa_thread(), but there's some related taskq code that can never be
invoked without it, which makes some already-tricky code harder to read.
HAVE_SYSDC is introduced in this commit to mark the SDC parts. SDC
requires spa_thread(), but the inverse is not true, so they are
separate.
I don't want to make the call to just remove it because I still harbour
hopes that OpenZFS could become a first-class citizen on Illumos
someday. But hopefully this will at least make the reason it exists a
bit clearer for people without long memories and/or an interest in
history.
For those that are interested in the history, the original FreeBSD port
of ZFS (before ZFS-on-Linux was adopted there) did have a spa_thread(),
but not SDC. The last version of that before it was removed can be read
here:
22df1ffd81/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
Meanwhile, more information on the SDC scheduling class is here:
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/disp/sysdc.c
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15406
In the zfs_id_over*quota functions, there is a short-circuit to skip
the zap_lookup when the quota zap does not exist. If quotas are never
used in a zpool, then the quota zap will never exist. But if
user/group/project quotas are ever used, the zap objects will be
created and will persist even if the quotas are deleted.
The quota zap_lookup in the write path can become a bottleneck for
write-heavy small I/O workloads. Before this commit, it was not
possible to remove this lookup without creating a new zpool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sam Atkinson <samatk@amazon.com>
Closes#14721
In my understanding ARC_BUF_SHARED() and arc_buf_is_shared() should
return identical results, except the second also asserts it deeper.
The first is much cheaper though, saving few pointer dereferences.
Replace production arc_buf_is_shared() calls with ARC_BUF_SHARED(),
and call arc_buf_is_shared() in random assertions, while making it
even more strict.
On my tests this in half reduces arc_buf_destroy_impl() time, that
noticeably reduces hash_lock congestion under heavy dbuf eviction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15397
Torn reads/writes of dp_dirty_total are unlikely: on 64-bit systems
due to register size, while on 32-bit due to memory constraints.
And even if we hit some race, the code implementing the delay takes
the lock any way.
Removal of the poll-wide lock acquisition saves ~1% of CPU time on
8-thread 8KB write workload.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15390
Once we trigger the zpool scrub, all zpool/zfs command gets stuck for
180 seconds. Post 180 seconds zpool/zfs commands gets start executing
however few more seconds(10s) it take to update the status. hence
sleeping for 200 seconds so that we get the correct status.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: vaibhav.bhanawat <vaibhav.bhanawat@delphix.com>
Closes#15364
We already use ____cacheline_aligned in many places, so add one more
instead of seems arbitrary char tc_pad[8].
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15402
Reviewed-by: Rob N <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dennis R. Friedrichsen <dennis.r.friedrichsen@gmail.com>
Closes#15417
Variable 'uma_align_cache' has not been used since commit "FreeBSD: Use
a hash table for taskqid lookups" (3933305ea). Moreover, it is soon
going to become private to FreeBSD's UMA in 15.0-CURRENT (main),
14.0-STABLE (stable/14) and 13.2-STABLE (stable/13). Should accessing
this information become necessary again, one will have to use the new
accessors for recent versions.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olivier Certner <olce.freebsd@certner.fr>
Closes#15416
When a vdev is to be expanded -- either via `zpool online -e` or via
the autoexpand option -- a SPA_ASYNC_CONFIG_UPDATE request is queued
to be handled via an asynchronous worker thread (spa_async_thread).
This normally happens almost immediately; but will be delayed up to
zfs_ccw_retry_interval seconds (default 5 minutes) if an attempt to
write the zpool configuration cache failed.
When FreeBSD boots ZFS-root VM images generated using `makefs -t zfs`,
the zpoolupgrade rc.d script runs `zpool upgrade`, which modifies the
pool configuration and triggers an attempt to write to the cache file.
This attempted write fails because the filesystem is still mounted
read-only at this point in the boot process, triggering a 5-minute
cooldown before SPA_ASYNC_CONFIG_UPDATE requests will be handled by
the asynchronous worker thread.
When expanding a vdev, reset the "when did a configuration cache
write last fail" value so that the SPA_ASYNC_CONFIG_UPDATE request
will be handled promptly. A cleaner but more intrusive option would
be to use separate SPA_ASYNC_ flags for "configuration changed" and
"try writing the configuration cache again", but with FreeBSD 14.0
coming very soon I'd prefer to leave such refactoring for a later
date.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colin Percival <cperciva@FreeBSD.org>
Closes#15405
The change is simple -- restore the original code so that the VDEV
path is updated when using by-id paths. The more challenging part
was to devise a second ZTS test, that would test auto-replace for
'by-id' and help prevent a future regression.
With that new test, we can now do an A|B test with , and without,
the fix to confirm that auto-replace for by-id paths works. The
existing auto-replace test, functional/fault/auto_replace_001_pos,
will confirm that we didn't break auto-replace for 'by-vdev' paths.
In the original functional/fault/auto_replace_001_pos test, the disk
wipe (using dd) was not effective in removing the partitioning since
the kernel was never informed of the wipe.
Added a call to wipefs(8) so that the kernel is informed and ZED will
re-partition the device.
Added a validation step that the re-partitioning occurred by
confirming that the GPT partition UUID changes.
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15363
For synchronous write workloads with large IO sizes, a pool configured
with a slog performs worse than one with an embedded zil:
sequential_writes 1m sync ios, 16 threads
Write IOPS: 1292 438 -66.10%
Write Bandwidth: 1323570 448910 -66.08%
Write Latency: 12128400 36330970 3.0x
sequential_writes 1m sync ios, 32 threads
Write IOPS: 1293 430 -66.74%
Write Bandwidth: 1324184 441188 -66.68%
Write Latency: 24486278 74028536 3.0x
The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`,
if a zil block is greater than 768K, the priority of the write is
downgraded from sync to async. Increasing the value allows greater
throughput. To select a value for this PR, I ran an fio workload with
the following values for `zil_slog_bulk`:
zil_slog_bulk KiB/s
1048576 422132
2097152 478935
4194304 533645
8388608 623031
12582912 827158
16777216 1038359
25165824 1142210
33554432 1211472
50331648 1292847
67108864 1308506
100663296 1306821
134217728 1304998
At 64M, the results with a slog are now improved to parity with an
embedded zil:
sequential_writes 1m sync ios, 16 threads
Write IOPS: 438 1288 2.9x
Write Bandwidth: 448910 1319062 2.9x
Write Latency: 36330970 12163408 -66.52%
sequential_writes 1m sync ios, 32 threads
Write IOPS: 430 1290 3.0x
Write Bandwidth: 441188 1321693 3.0x
Write Latency: 74028536 24519698 -66.88%
None of the other tests in the performance suite (run with a zil or
slog) had a significant change, including the random_write_zil tests,
which use multiple datasets.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes#14378
- Group tqent_task and tqent_timeout_task into a union. They are
never used same time. This shrinks taskq_ent_t from 192 to 160 bytes.
- Remove tqent_registered. Use tqent_id != 0 instead.
- Remove tqent_cancelled. Use taskqueue pending counter instead.
- Change tqent_type into uint_t. We don't need to pack it any more.
- Change tqent_rc into uint_t, matching refcount(9).
- Take shared locks in taskq_lookup().
- Call proper taskqueue_drain_timeout() for TIMEOUT_TASK in
taskq_cancel_id() and taskq_wait_id().
- Switch from CK_LIST to regular LIST.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15356
When doing a manual TRIM on a zpool, the metaslab being TRIMmed is
potentially re-enabled before all queued TRIM zios for that metaslab
have completed. Since TRIM zios have the lowest priority, it is
possible to get into a situation where allocations occur from the
just re-enabled metaslab and cut ahead of queued TRIMs to the same
metaslab. If the ranges overlap, this will cause corruption.
We were able to trigger this pretty consistently with a small single
top-level vdev zpool (i.e. small number of metaslabs) with heavy
parallel write activity while performing a manual TRIM against a
somewhat 'slow' device (so TRIMs took a bit of time to complete).
With the patch, we've not been able to recreate it since. It was on
illumos, but inspection of the OpenZFS trim code looks like the
relevant pieces are largely unchanged and so it appears it would be
vulnerable to the same issue.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jking@racktopsystems.com>
Illumos-issue: https://www.illumos.org/issues/15939Closes#15395