1) Make mmap flushes synchronous. Linux may skip flushing dirty pages
already in writeback unless data-integrity sync is requested.
2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be
skipped due to DMU pushing back on TX assign.
3) Add missing mmap flush when doing block cloning.
4) While here, pass errors from putpage to writepage/writepages.
This change fixes corruption edge cases, but unfortunately adds
synchronous ZIL flushes for dirty mmap pages to llseek and bclone
operations. It may be possible to avoid these sync writes later
but would need more tricky refactoring of the writeback code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Robert Evans <evansr@google.com>
Closes#15933Closes#16019
- Remove custom zap_memset(), use regular memset().
- Use PANIC() instead of opaque cmn_err(CE_PANIC).
- Provide entry parameter to zap_leaf_rehash_entry().
- Reduce branching in zap_leaf_array_create() inner loop.
- Remove signedness where it should not be.
Should be no function changes.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15976
It does not look important how exactly brt_pending_tree is sorted.
When cloning large file, it is quite likely that all of its blocks
have identical physical birth times, so comparing them first does
not provide useful entropy, while accesses additional cache line.
In most cases combination of vdev and offset provides unique result
and physical birth time comparison is not even needed. Meanwhile,
when traversing the tree inside brt_pending_apply(), it can be
beneficial for dbuf cache and CPU cache hits to group processing
by vdev and so by the per-VDEV BRT ZAPs.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15954
Before this change resume token was updated only on data receive.
Usually it is enough to resume replication without much overlap.
But we've got a report of a curios case, where replication source
was traversed with recursive grep, which through enabled atime
modified every object without modifying any data. It produced
several gigabytes of replication traffic without a single data
write and so without a single resume point.
While the resume token was not designed to resume from an object,
I've found that the send implementation always sends object before
any data. So by requesting resume from offset 0 we are effectively
resuming from the object, followed (or not) by the data at offset
0, just as we need it.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15927
Linux 6.8 removes generic_copy_file_range(), which had been reduced to a
simple wrapper around splice_copy_file_range(). Detect that function
directly and use it if generic_ is not available.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#15930Closes#15931
Follow up to 99495ba6ab which
accidentally introduce this regression.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Quartz <yyhran@163.com>
Closes#15907
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Most values in zio_checksum can never be used for dedup, partly because
the dedup= property only offers a limited list, but also some values (eg
ZIO_CHECKSUM_OFF) aren't real and will never be seen.
A true flag would be better than a hardcoded list, but thats more
cleanup elsewhere than I want to do right now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Only a single bit is needed to track entry state, and definitely not two
whole bytes. Some light refactoring in ddt_lookup() is needed to support
this, but it reads a lot better now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Store objects store keys and values, so have them take those types and
nothing more. This way, they don't need to be concerned about the "kind"
of entry being operated on; the dispatch layer can take care of the
appropriate conversions.
This adds a "contains" op to see if a particular entry exists without
loading it, which makes a couple of things easier to do; in particular,
it allows us to avoid an allocation in ddt_class_contains().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
ddt_get_dedup_histogram() was actually checking it, just in an extremely
cursed way. ddt_get_dedup_object_stats() wasn't, but wasn't being called
from a dangerous place so no one noticed.
These checks are necessary, because spa_ddt[] is not populated until
spa_load(), but the spa can exist before that, while being created, and
as vdevs and metaslabs are initialised the space accounting functions
will be called to update pool space counts.
Probably the whole create path doesn't need to go asking for space
accounting from metadata subsystems until after the pool is created.
This will at least catch misuse.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Mostly for consistency, so the reader is less likely to wonder why these
things look different.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Just to make it easier to know which bits to pay attention to.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
It was a weird and confusing name, because it wasn't actually returning
the number of DVAs in the entry (as in, in the value/phys part) but the
maximum number of possible DVAs in a BP generated from the entry, based
on the encrypt bit in the key. This is unlike the similarly named
BP_GET_NDVAS, which really does return the number of DVAs.
Since its only used in this one place, and for a specific purpose, it
seemed more sensible to just write it in-place and remove the name.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
We want to add other kinds of dedup-related objects and keep stats for
them. This makes those functions easier to use from outside ddt.c.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
We're about to have different kinds of things that we'll compare on key,
so generalise this function to support that.
(It actually worked fine because of the way the casts work out, but it
requires the key to be at the start of the object so the cast through
ddt_entry_t works, and even then it reads strangely for anything that's
not a ddt_entry_t).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Always do them on the heap, and when we know how much we need, only that
much.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
I think I can say with some confidence that anyone making a new storage
type in 2023 is doing their own thing with compression, not this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15887
This changes taskq_thread_should_stop() to limit maximum exit rate
for idle threads to one per 5 seconds. I believe the previous one
was broken, not allowing any thread exits for tasks arriving more
than one at a time and so completing while others are running.
Also while there:
- Remove taskq_thread_spawn() calls on task allocation errors.
- Remove extra taskq_thread_should_stop() call.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15873
Similar to deduplication, the size of data duplicated by block cloning
should not be included in the slop space calculation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuxin Wang <yuxinwang9999@gmail.com>
Closes#15874
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay). However, no action is taken by ZED,
like is done for checksum or I/O errors. This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
VDEV_PROP_SLOW_IO_N
VDEV_PROP_SLOW_IO_T
If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15469
CVE-2020-24370 is a security vulnerability in lua. Although the CVE
description in CVE-2020-24370 said that this CVE only affected lua
5.4.0, according to lua this CVE actually existed since lua 5.2. The
root cause of this CVE is the negation overflow that occurs when you
try to take the negative of 0x80000000. Thus, this CVE also exists in
openzfs. Try to backport the fix to the lua in openzfs since the
original fix is for 5.4 and several functions have been changed.
https://github.com/advisories/GHSA-gfr4-c37g-mm3vhttps://nvd.nist.gov/vuln/detail/CVE-2020-24370https://www.lua.org/bugs.html#5.4.0-11https://github.com/lua/lua/commit/a585eae6e7ada1ca9271607a4f48dfb1786
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ChenHao Lu <18302010006@fudan.edu.cn>
Closes#15847
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error. The range may be for an entire file. While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.
As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.
Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning. This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned. This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller. However, the destination file range is left in an undefined
state.
A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15728Closes#15842
On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue
trim command which is synchronous.
This commit updates vdev_disk_io_trim to use __blkdev_issue_discard,
which is asynchronous. Unfortunately there isn't any asynchronous
version for blkdev_issue_secure_erase, so performance of secure trim
will still suffer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15843
MAX_ORDER has been renamed to MAX_PAGE_ORDER. Rather than just
redefining it, instead define our own name and set it consistently from
the start.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15805
Linux has removed strlcpy in favour of strscpy. This implements a
fallback implementation of strlcpy for this case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15805
blkdev_get_by_path() and blkdev_put() have been replaced by
bdev_open_by_path() and bdev_release(), which return a "handle" object
with the bdev object itself inside.
This adds detection for the new functions, and macros to handle the old
and new forms consistently.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15805
During device removal stress tests, we noticed that we were tripping
the assertion that mg_initialized was true. After investigation, it was
determined that the mg in question was the embedded log metaslab
group for a newly added vdev; the normal mg had been initialized (by
metaslab_sync_reassess, via vdev_sync_done). However, because the spa
config alloc lock is not held as writer across both calls to
metaslab_sync_reassess, it is possible for an allocation to happen
between the two metaslab_groups being initialized. Because the metaslab
code doesn't check the group in question, just the vdev's main mg, it
is possible to get past the initial check in vdev_allocatable and
later fail due to the assertion.
We simply remove the assertions. We could also consider locking the
ALLOC lock around the reassess calls in vdev_sync_done, but that risks
deadlocks. We could check the actual target mg in vdev_allocatable,
but that risks racing with a passivation that comes in after that
check but before the assertion. We still won't be able to actually
allocate from the metaslab group if no metaslabs are ready, so this
change shouldn't break anything.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15818
Removed the list_size struct member as it was only used in a single
assertion, as mentioned in PR #15478.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: MigeljanImeri <imerimigel@gmail.com>
Closes#15812
If devid or physpath for a vdev changes between imports, ensure it is
updated to the new value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15816
Descriptor leak can be easily reproduced by doing:
# zpool import tank
# sysctl kern.openfiles
# zpool export tank; zpool import tank
# sysctl kern.openfiles
We were leaking four file descriptors on every import.
Similar leak most likely existed when using file-based VDEVs.
External-issue: https://reviews.freebsd.org/D43529
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#15630
Switch from cv_wait() to cv_wait_idle() in vdev_autotrim_wait_kick(),
which should mitigate the high load average while waiting.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes#15781
If the destination file is mmaped and the mmaped region was already
read, so it is cached, we need to update mmaped pages after successful
clone using update_pages().
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Pointed out by: Ka Ho Ng <khng@freebsd.org>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#15772
In db4fc559c I messed up and changed this bit of code to set the inode
atime to an uninitialised value, when actually it was just supposed to
loading the atime from the inode to be stored in the SA. This changes it
to what it should have been.
Ensure times change by the right amount Previously, we only checked
if the times changed at all, which missed a bug where the atime was
being set to an undefined value.
Now ensure the times change by two seconds (or thereabouts), ensuring
we catch cases where we set the time to something bonkers
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#15762Closes#15773
On Linux x86_64, kmem cache can have size up to 4M,
however increasing spl_kmem_cache_slab_limit can lead
to crash due to the size check inconsistency.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#15757
Pool import logic uses vdev paths, so it makes sense to add path
information on AUX vdev as well.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15737
When spare or l2cache (aux) vdev is added during pool creation,
spa->spa_uberblock is not dumped until that point. Subsequently,
the aux label is never synchronized after its initial creation,
resulting in the uberblock label remaining undumped. The uberblock
is crucial for lib_blkid in identifying the ZFS partition type. To
address this issue, we now ensure sync of the uberblock label once
if it's not dumped initially.
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15737
For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15719
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl(). In particular, this code triggers an assertion
failure in sbuf_clear().
Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.
Fixes: 6930ecbb7 ("spa: make read/write queues configurable")
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15719
In general, VOPs must not load the "z_log" field until having called
zfs_enter_verify_zp().
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15752
We need to wait until after having done a zfs_enter() to load some
fields from the zfsvfs structure. Otherwise a use-after-free is
possible in the face of a concurrent rollback.
Other functions in this file are careful to avoid this bug, I believe
this is the only instance.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15752
Two block pointers in livelist pointing to the same location may
be caused not only by dedup, but also by block cloning. We should
not assert D bit set in them.
Two block pointers in livelist pointing to the same location may
have different logical birth time in case of dedup or cloning. We
should assert identical physical birth time instead.
Assert identical physical block size between pointers in addition
to checksum, since that is what checksums are calculated on.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15732
- Fail if source block is smaller than destination. We can only
grow blocks, not shrink them.
- Fail if we do not have full znode range lock. In that case grow
is not even called. We should improve zfs_rangelock_cb() somehow
to know when cloning needs to grow the block size unlike write.
- Fail of we tried to resize, but failed. There are many reasons
for it to fail that we can not predict at this level, so be ready
for them. Unlike write, that may proceed after growth failure,
block cloning can't and must return error.
This fixes assertion inside dmu_brt_clone() when it sees different
number of blocks held in destination than it got block pointers.
Builds without ZFS_DEBUG returned EXDEV, so are not affected much.
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15724Closes#15735
On some systems we already have blkdev_get_by_path() with 4 args
but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined.
The vdev_bdev_mode() function was added to handle this case
but there was no generic way to specify exclusive access.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15692
While 763ca47 closes the situation of block cloning creating
unencrypted records in encrypted datasets, existing data still causes
panic on read. Setting zfs_recover bypasses this but at the cost of
potentially ignoring more serious issues.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Peredun <chris.peredun@ixsystems.com>
Closes#15677
Track history in context of bursts, not individual log blocks. It
allows to not blow away all the history by single large burst of
many block, and same time allows optimizations covering multiple
blocks in a burst and even predicted following burst. For each
burst account its optimal block size and minimal first block size.
Use that statistics from the last 8 bursts to predict first block
size of the next burst.
Remove predefined set of block sizes. Allocate any size we see fit,
multiple of 4KB, as required by ZIL now. With compression enabled
by default, ZFS already writes pretty random block sizes, so this
should not surprise space allocator any more.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15635
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.
This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.
This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.
Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15675
6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robnCloses#15681