Multiple changes to the blkdev API were introduced in Linux 6.5. This
includes passing (void* holder) to blkdev_put, adding a new
blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type
that replaces uses of fmode_t, and removing an argument from the release
handler on block_device_operations that we weren't using. The open
function definition has also changed to take gendisk* and blk_mode_t, so
update it accordingly, too.
Implement local wrappers for blkdev_get_by_path() and
vdev_blkdev_put() so that the in-line calls are cleaner, and place the
conditionally-compiled implementation details inside of both of these
local wrappers. Both calls are exclusively used within vdev_disk.c, at
this time.
Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE
The wrapper function is now used for testing using the appropriate
method for the kernel, whether the open mode is writable or not.
Emphasize fmode_t arg in zvol_release is not used
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15099
When disk_check_media_change() exists, then define
zfs_check_media_change() to simply call disk_check_media_change() on
the bd_disk member of its argument. Since disk_check_media_change()
is newer than when revalidate_disk was present in bops, we should
be able to safely do this via a macro, instead of recreating a new
implementation of the inline function that forces revalidation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15101
Fastwrite was introduced many years ago to improve ZIL writes spread
between multiple top-level vdevs by tracking number of allocated but
not written blocks and choosing vdev with smaller count. It suposed
to reduce ZIL knowledge about allocation, but actually made ZIL to
even more actively report allocation code about the allocations,
complicating both ZIL and metaslabs code.
On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync()
was lost many years ago, that was one of the declared benefits. Plus
introduction of embedded log metaslab class solved another problem
with allocation rotor accounting both normal and log allocations,
since in most cases those are now in different metaslab classes.
After all that, I'd prefer to simplify already too complicated ZIL,
ZIO and metaslab code if the benefit of complexity is not obvious.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15107
Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.
There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
This implements the Linux VFS ops required to service the file
copy/clone APIs:
.copy_file_range (4.5+)
.clone_file_range (4.5-4.19)
.dedupe_file_range (4.5-4.19)
.remap_file_range (4.20+)
Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes#15050
- Reduce maximum prefetch distance for 32bit platforms to 8MB as it
was previously. Those systems didn't grow much probably, so better
stay conservative there.
- Retire array_rd_sz tunable, blocking prefetch for large requests.
We should not penalize applications trying to be more efficient. The
speculative prefetcher by itself has reasonable distance limits, and
1MB is not much at all these days.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15072
To simplify error handling bpobj_iterate_blkptrs() iterates through
the list of block pointers backwards. Unfortunately speculative
prefetcher is currently unable to detect such patterns, that makes
each block read there synchronous and very slow on HDD pools.
According to my tests, added explicit prefetch reduces time needed
to asynchronously delete 8 snapshots of 4 million blocks each from
20 seconds to less than one, that should free sync thread for other
useful work, such as async writes, scrub, etc.
While there, plug one memory leak in case of bpobj_open() error and
harmonize some variable names.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15071
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely. So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match". The harm in sending them is simply that they bloat the
event. In particular, on FreeBSD the event must fit into a 1016 byte
buffer.
Fixes#14717 for mirrored pools.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#14717Closes#15052
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete. With modern storage hardware, they will
almost always look like white noise; all bits will be wrong. They only
serve to bloat the event. That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.
This fixes issue #14717 for RAIDZ pools, but not for mirror pools.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes#15052
Since spa_min_alloc may not be a power of 2, unlike ashifts, in the
case of DRAID, we should not select the minimal value among several
vdevs. Rounding to a multiple of it is unlikely to work for other
vdevs. Instead, using the greatest common divisor produces smaller
yet more reasonable results.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15067
Ashift can be set for a vdev only during its creation, and the
top-level vdev does not change when a vdev is attached or replaced.
The ashift property should not be used during attachment, as it
does not allow attaching/replacing a vdev if the pool's ashift
property is increased after the existing vdev was created. Instead,
we should be able to attach the vdev if the attached vdev can
satisfy the ashift requirement with its parent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15061
The disk_check_media_change() function was added which replaces
bdev_check_media_change. This change was introduced in 6.5rc1
444aa2c58cb3b6cfe3b7cc7db6c294d73393a894 and the new function takes a
gendisk* as its argument, no longer a block_device*. Thus, bdev->bd_disk
is now used to pass the expected data.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15060
As it turns out having autotrim default to 'on' on FreeBSD never really
worked due to mess with defines where userland and kernel module were
getting different default values (userland was defaulting to 'off',
module was thinking it's 'on').
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15079
This change was introduced in Linux commit
7ba150834b840f6f5cdd07ca69a4ccf39df59a66
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15059
Make the version here match that elsewhere in the kernel and system
headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#15058
- Pack struct zio_prop by 4 bytes from 84 to 80.
- Skip new child ZIO locking while linking to parent. The newly
allocated ZIO is not externally visible yet, so nobody should care.
- Skip io_bp_copy writes when not used (write && non-debug).
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14985
Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O. This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed. While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.
This should partially address the same issue as c85ac731a0, but
from a different perspective, complementing it.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15007
Consistently get the proper default value for autotrim.
Currently, only the kernel module is built with IN_FREEBSD_BASE,
and libzfs get the wrong default value, leading to confusion and
incorrect output when autotrim value was not set explicitly.
Reviewed-by: Warner Losh <imp@bsdimp.com>
Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org>
Closes#15016
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists. AVL-trees are too expensive
for such a simple task. To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.
To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue. Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member. Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().
Introduce deadline of ~0.5s for LBA-sorted queues. Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation. With this change I no longer see it. I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal. For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.
Arrange members in struct zio to access only one cache line when
searching through vdev queues. While there, remove io_alloc_node,
reusing the io_queue_node instead. Those two are never used same
time.
Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests. Just remove the tree.
Remove locking from txg_all_lists_empty(). It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.
With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14925
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...
I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...
So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14938
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14948
With large number of tracked references list searches under the lock
become too expensive, creating enormous lock contention.
On my tests with ZFS_DEBUG enabled this increases write throughput
with 32KB blocks from ~1.2GB/s to ~7.5GB/s.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14970
It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result. But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years. These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.
Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14953
There's no particular reason this function should be kernel-only, and I
want to use it (indirectly) from zdb. I've moved it to zfs_znode.c
because libzpool does not compile in zfs_vfsops.c, and this at least
matches the header its imported from.
Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#14642
There are two places where we need to add/remove several references
with semantics of zfs_refcount_(add|remove). But when debug/tracing
is disabled, it is a crime to run multiple atomic_inc() in a loop,
especially under congested pool-wide allocator lock.
Introduced new functions implement the same semantics as the loop,
but without overhead in production builds.
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14934
This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.
Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.
Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14866
In addition to a number of actual log bytes written, account also a
total written bytes including padding and total allocated bytes (bytes
<= write <= alloc). It should allow to monitor zil traffic and space
efficiency.
Add dtrace probe for zil block size selection.
Make zilstat report more information and fit it into less width.
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14863
Before this change ZIL copied all log data while holding the lock.
It caused huge lock contention on workloads with many big parallel
writes. This change splits the process into two parts: first,
zil_lwb_assign() estimates the log space needed for all transactions,
and zil_lwb_write_close() allocates blocks and zios while holding the
lock, then, after the lock in dropped, zil_lwb_commit() copies the
data, and zil_lwb_write_issue() issues the I/Os.
Also while there slightly reduce scope of zl_lock.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14841
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#8995Closes#12355
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.
So let's add zpool initialize -u to clear it.
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12451Closes#14873
The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
1. txg is 0 for unassigned tx, so it was a no-op.
2. It is equivalent of checking if we have dirty records and we are doing
this few lines earlier.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#14825
I don't know an easy way to shrink down dbuf size, so just deny block cloning
into dbufs that don't match our BP's size.
This fixes the following situation:
1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
3. Truncate the large file to 0. Its dbuf will remain 2kB.
4. Clone the small file into the large file. Small file's BP lsize is
1kB, but the large file's dbuf is 2kB.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#14825
Reimplement some of the block cloning vs dbuf logic, mostly to fix
situation where we clone a block and in the same transaction group
we want to partially overwrite the clone.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#14825
Provides an interface which callers can use to declare a write when
the exact starting offset in not yet known. Since the full range
being updated is not available only the first L0 block at the
provided offset will be prefetched.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14819
In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.
Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.
To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.
Before this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: Permanent errors have been detected in the following files:
=============================
After this change:
=============================
pool: test
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
/home/user/vdev_a ONLINE 0 0 2
errors: No known data errors
=============================
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14813
If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.
However, such corruption is not typically recoverable. To recover from
it we would need to detect the memory error before the block pointer is
written to disk.
This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.
To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.
Additionally, when corruption is detected, the raw words of the block
pointer are logged. (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#14817
Clang specific pragmas need to be wrapped to prevent a build
warning when compiling with gcc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14814
FreeBSD/powerpc64 is all ELFv2 since FreeBSD 13, even big endian. The
existing sha256 and sha512 asm code assumes that BE is all ELFv1, and LE
is ELFv2. Minor changes to add ELFv2 in the BE side gets this working
correctly on FreeBSD with latest OpenZFS import.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Justin Hibbits <chmeeedalf@gmail.com>
Closes#14779
This reverts commit 4c856fb333 to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14775Closes#14790
Add loongarch64 definitions & lua module setjmp asm
LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Han Gao <gaohan@uniontech.com>
Signed-off-by: WANG Xuerui <xen0n@gentoo.org>
Closes#13422
Usage:
zpool set org.freebsd:comment="this is my pool" poolname
Tests are based on zfs_set's user property tests.
Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes#11680
Clang points out that there is a comparison against -1, but we cannot
fix it because that is from the kernel headers, which we must support.
We can workaround this by using a pragma.
Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes#14738
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.
Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2
Notably, this allows to get/set properties on the root vdev:
% zpool set user:prop=value <pool> root-0
Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):
% zpool set user:prop=value <pool> mirror-0
This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.
The following tests have been added:
- zpool get all properties from root vdev
- zpool set a property on root vdev
- verify root vdev ZAP is created
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes#14405
At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Herb Wartens <hawartens@gmail.com>
Closes#14659
Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14722
Add a new union member of flexible array to dnode_phys_t and use
it in the macro so we can silence the memcpy() fortify error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14737
Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap
type for mounts (struct mnt_idmap), we need to detect these changes
and make zfs work with the new APIs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14682
Just like x86, aarch64 needs to use the fpu_kern(9) API around FPU
usage, otherwise we panic promptly at boot as soon as ZFS attempts to
do checksum benchmarking.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes#14715
f6a0dac84 modified the zfs_iter_* functions to take a new "flags"
parameter, and introduced a variety of flags to ask the kernel to limit
the results in various ways, reducing the amount of work the caller
needed to do to filter out things they didn't need.
Unfortunately this change broke the ABI for existing clients (read:
older versions of the `zfs` program), and was reverted 399b98198.
dc95911d2 reintroduced the original patch, with the understanding that a
backwards-compatible fix would be made before the 2.2 release branch was
tagged. This commit is that fix.
This introduces zfs_iter_*_v2 functions that have the new flags
argument, and reverts the existing functions to not have the flags
parameter, as they were before. The old functions are now reimplemented
in terms of the new, with flags set to 0.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Original-patch-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Closes#14597
Add missing machine/md_var.h to spl/sys/simd_aarch64.h and
spl/sys/simd_arm.h
In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE
on i386
In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need
a hack in vnode.h similar to Linux. sys/simd.h needs to be included early.
In zfs_freebsd_copy_file_range() we pass a (size_t *)lenp to
zfs_clone_range() that expects a (uint64_t *)
Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and
sha512_impl.c to __ARM_ARCH > 6
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Signed-off-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes#14674
Running `zfs list -o avail rpool` resulted in a core dump.
This commit will fix this.
Run the needed overhead only, when `use_color()` is true.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#14712
Modify bio_set_flush() so if kernel version is >= 4.10, flags
REQ_PREFLUSH and REQ_OP_WRITE are set together.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14695
Address the following bugs in persistent error log:
1) Check nested clones, eg "fs->snap->clone->snap2->clone2".
2) When deleting files containing error blocks in those clones (from
"clone" the example above), do not break the check chain.
3) When deleting files in the originating fs before syncing the errlog
to disk, do not break the check chain. This happens because at the
time of introducing the error block in the error list, we do not have
its birth txg and the head filesystem. If the original file is
deleted before the error list is synced to the error log (which is
when we actually lookup the birth txg and the head filesystem), then
we do not have access to this info anymore and break the check chain.
The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).
We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.
The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64
We also introduced two new tests covering the above changes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14633
Current autotrim causes short-lived txg through:
1. calling txg_wait_synced() in metaslab_enable()
2. calling txg_wait_open() with should_quiesce = true
This patch addresses all the issues mentioned above.
A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity.
It will be signaled once a txg is synced so that it does not change
the original autotrim pace. Also because it is a cv, the wait is
interruptible which speeds up the vdev_autotrim_stop_wait() call.
Finally, combining big zfs_txg_timeout, txg_wait_open() also causes
delay when exporting a pool.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Issue #8993Closes#12194
Linux 6.3+, and backports from it (6.2.8+), changed the
signatures on bdev_io_{start,end}_acct. Add a case for it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14658Closes#14668
This is probably the uncontroversial part of #13631, which fixes
a real problem people are having.
There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.
Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#14649
There is a window in the slog removal code where a panic loop could
ensue if the system crashes during that operation. The original design
of slog removal did not persisted any state because the removal happened
synchronously. This was changed by a later commit which persisted the
vdev_removing flag and exposed this bug. If a slog removal is in
progress and happens to crash after persisting the vdev_removing flag to
the label but before the vdev is removed from the spa config, then the
pool will continue to panic on import. Here's a sample of the panic:
[ 134.387411] VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22)
[ 134.393865] PANIC at dmu.c:1135:dmu_write()
[ 134.396035] Kernel panic - not syncing: VERIFY0(0 ==
dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs,
&dbp)) failed (0 == 22)
[ 134.397857] CPU: 2 PID: 5914 Comm: txg_sync Kdump: loaded Tainted:
P OE 5.4.0-1100-dx2023020205-b3751f8c2-azure #106
[ 134.407938] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 134.407938] Call Trace:
[ 134.407938] dump_stack+0x57/0x6d
[ 134.407938] panic+0xfb/0x2d7
[ 134.407938] spl_panic+0xcf/0x102 [spl]
[ 134.407938] ? traverse_impl+0x1ca/0x420 [zfs]
[ 134.407938] ? dmu_object_alloc_impl+0x3b4/0x3c0 [zfs]
[ 134.407938] ? dnode_hold+0x1b/0x20 [zfs]
[ 134.407938] dmu_write+0xc3/0xd0 [zfs]
[ 134.407938] ? space_map_alloc+0x55/0x80 [zfs]
[ 134.407938] metaslab_sync+0x61a/0x830 [zfs]
[ 134.407938] ? queued_spin_unlock+0x9/0x10 [zfs]
[ 134.407938] vdev_sync+0x72/0x190 [zfs]
[ 134.407938] spa_sync_iterate_to_convergence+0x160/0x250 [zfs]
[ 134.407938] spa_sync+0x2f7/0x670 [zfs]
[ 134.407938] txg_sync_thread+0x22d/0x2d0 [zfs]
[ 134.407938] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[ 134.407938] thread_generic_wrapper+0x83/0xa0 [spl]
[ 134.407938] kthread+0x104/0x140
[ 134.407938] ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[ 134.407938] ? kthread_park+0x90/0x90
[ 134.457802] ret_from_fork+0x1f/0x40
This change no longer persists the vdev_removing flag when removing slog
devices and also cleans up some code that was added which is not used.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#14652
Undirty the dbuf and destroy its buffer when cloning into it.
Coverity ID: CID-1535375
Reported-by: Richard Yao
Reported-by: Benjamin Coddington
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#14655
Commit 11913870 (#14567) added cmn_err_once() by #define'ing a
compound statement but failed to consider usage in a single
statement brace-less if else.
Fix the problem by using the common "do {} while (0)" construct.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#14629
Commit 5401472 adds a check to call enable_kernel_spe and
disable_kernel_spe only if CONFIG_SPE is defined. Refactor this check
in a way similar to what CONFIG_ALTIVEC and CONFIG_VSX are checked, in
order to remove redundant kfpu_begin() and kfpu_end() implementations.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes#14623
This commit removes the edonr_byteorder.h file and all unused
variants of Edon-R.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13618
Coverity reported possible out-of-bounds reads from doing `((char
*)(nvp) + sizeof (nvpair_t))` to get the nvpair name string. These were
initially marked as false positives, but since we are now using C99
flexible array members elsewhere, we could use them here too as cleanup
to make the code easier to understand.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reported-by: Coverity (CID-977165)
Reported-by: Coverity (CID-1524109)
Reported-by: Coverity (CID-1524642)
Closes#14612
After addressing coverity complaints involving `nvpair_name()`, the
compiler started complaining about dropping const. This lead to a rabbit
hole where not only `nvpair_name()` needed to be constified, but also
`nvpair_value_string()`, `fnvpair_value_string()` and a few other static
functions, plus variable pointers throughout the code. The result became
a fairly big change, so it has been split out into its own patch.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14612
Clang's static analyzer complains that nvs_xdr() and nvs_native()
functions return pointers to stack memory. That is technically true, but
the pointers are stored in stack memory from the caller's stack frame,
are not read by the caller and are deallocated when the caller returns,
so this is harmless. We set the pointers to NULL to silence the
warnings.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14612
The commit replaces all findings of the link:
http://www.opensolaris.org/os/licensing with this one:
https://opensource.org/licenses/CDDL-1.0
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#14625
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in #14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.
To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#14600
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).
The whole design of block cloning is documented in module/zfs/brt.c.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#13392
Linux since 4.7 makes interface 'cpu_has_feature' to use jump labels on
powerpc if CONFIG_JUMP_LABEL_FEATURE_CHECKS is enabled, in this case
however the inline function references GPL-only symbol
'cpu_feature_keys'.
ZFS currently uses 'cpu_has_feature' either directly or indirectly from
several places; while it is unknown how this issue didn't break ZFS on
64-bit little-endian powerpc, it is known to break ZFS with many Linux
versions on both 32-bit and 64-bit big-endian powerpc.
Until this issue is fixed in Linux, we have to workaround it by
overriding affected inline functions without depending on
'cpu_feature_keys'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes#14590
Clang's static analyzer pointed out that if we can pass a -1 array index
to copyname[copies] if there are no valid DVAs. This is an absurd
situation, but it suggests that we are missing an assertion, so we add
it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14575
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14359
This fixes building ZFS for Linux 4.7+ powerpc* architecture, where
Linux was configured without CONFIG_ALTIVEC or CONFIG_VSX.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes#14591
The intent is that this is like ENOTSUP, but specifically for when
something can't be done because we have no support for the requested
crypto parameters; eg unlocking a dataset or receiving a stream
encrypted with a suite we don't support.
Its not intended to be recoverable without upgrading ZFS itself.
If the request could be made to work by enabling a feature or modifying
some other configuration item, then some other code should be used.
load-key: In the future we might have more crypto suites (ie new values
for the `encryption` property. Right now trying to load a key on such
a future crypto suite will look up suite parameters off the end of the
crypto table, resulting in misbehaviour and/or crashes (or, with debug
enabled, trip the assertion in `zio_crypt_key_unwrap`).
Instead, lets check the value we got from the dataset, and if we can't
handle it, abort early.
recv: When receiving a raw stream encrypted with an unknown crypto
suite, `zfs recv` would report a generic `invalid backup stream`
(EINVAL). While technically correct, its not super helpful, so lets
ship a more specific error code and message.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#14577
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#14567
This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13741
The skeleton file module/icp/include/generic_impl.c can be used for
iterating over different implementations of algorithms.
It is used by SHA256, SHA512 and BLAKE3 currently.
The Solaris SHA2 implementation got replaced with a version which is
based on public domain code of cppcrypto v0.10.
These assembly files are taken from current openssl master:
- sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64)
- sha512-x86_64.S: x64, AVX, AVX2 (x86_64)
- sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm)
- sha512-armv7.S: ARMv7, NEON (arm)
- sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64)
- sha512-armv8.S: ARMv7, ARMv8-CE (aarch64)
- sha256-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha512-ppc.S: Generic PPC64 LE/BE (ppc64)
- sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
- sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13741
These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-Authored-By: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes#13741
These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64
Changes:
- simd_powerpc.h: change license from CDDL to BSD
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13741
We had three sha2.h headers in different places.
The FreeBSD version, the Linux version and the generic solaris version.
The only assembly used for acceleration was some old x86-64 openssl
implementation for sha256 within the icp module.
For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS,
these files got removed also.
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13741
With some pathological access patterns it is possible to make ZFS
accumulate almost unlimited amount of speculative prefetch ZIOs.
Combined with linear ABD allocations in RAIDZ code, it appears to
be possible to exhaust system KVA, triggering kernel panic.
Address this by introducing a system-wide counter of active prefetch
requests and blocking prefetch distance doubling per stream hits if
the number of active requests is higher that ~6% of ARC size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14516
openzfsonwindows/openzfs#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.
On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL. However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (#11097), so we can safely
disregard that issue for now.
Reported-by: Arun KV <arun.kv@datacore.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14514
A spurious mutex_exit() in a development branch caused weird issues
until I identified it. An assertion prior to mutex_exit() would have
caught it. Rather than adding assertions before invocations of
mutex_exit() in the code, let us simply add an assertion to
mutex_exit(). It is cheap and will likely improve developer
productivity.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes#14541
Commit 34ce4c4 made zfeature_active() non-static. This is not required.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14546
Hole detection in the zio compression code allows us to
opportunistically skip compression on holes. We can go a step further
by not doing memory allocations on holes either.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes#14500
In commit 0a5b942d4 the FreeBSD SECTION_STATIC macro was set to
".rodata". This assembler directive is supported by LLVM (as a
convenience alias for ".section .rodata") by not by GNU as.
This caused the FreeBSD builds that are done with gcc to fail.
Therefore, use ".section .rodata" instead, similar to the other
asm_linkage.h headers.
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes#14526
When a page is faulted in for memory mapped I/O the page lock
may be dropped before it has been read and marked up to date.
If a buffered read encounters such a page in mappedread() it
must wait until the page has been updated. Failure to do so
will result in a panic on debug builds and incorrect data on
production builds.
The critical part of this change is in mappedread() where pages
which are not up to date are now handled. Additionally, it
includes the following simplifications.
- zfs_getpage() and zfs_fillpage() could be passed an array of
pages. This could be more efficient if it was used but in
practice only a single page was ever provided. These
interfaces were simplified to acknowledge that.
- update_pages() was modified to correctly set the PG_error bit
on a page when it cannot be read by dmu_read().
- Setting PG_error and PG_uptodate was moved to zfs_fillpage()
from zpl_readpage_common(). This is consistent with the
handling in update_pages() and mappedread().
- Minor additional refactoring to comments and variable
declarations to improve readability.
- Add a test case to exercise concurrent buffered, direct,
and mmap IO to the same file.
- Reduce the mmap_sync test case default run time.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13608Closes#14498
Apply zero padding for checksums consistently. The SNPRINTF_BLKPTR
macro was not updated in commit ac7648179c which results in the
`cli_root/zdb/zdb_checksum.ksh` test case reliably failing.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14497
As of the 4.13 kernel filemap_range_has_page() can be used to
check if there is a page mapped in a given file range. When
available this interface should be used which eliminates the
need for the zp->z_is_mapped boolean.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14493
The zio returned from arc_write() in dmu_objset_sync() uses
zio_nowait(). However we may reach the end of dsl_dataset_sync()
which checks if we need to activate features in the filesystem
without knowing if that zio has even run through the ZIO pipeline yet.
In that case we will flag features to be activated in
dsl_dataset_block_born() but dsl_dataset_sync() has already
completed its run and those features will not actually be activated.
Mitigate this by moving the feature activation code in
dsl_dataset_sync_done(). Also add new ASSERTs in
dsl_scan_visitbp() checking if a block contradicts any filesystem
flags.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#13816
In https://github.com/openzfs/zfs/pull/14228 the FreeBSD
SECTION_STATIC was set to ".data" instead of ".rodata". This
commit just restores it back to .rodata.
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#14460
zfs_setproctitle_init() is stubbed out on FreeBSD.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob Wing <rob.fx907@gmail.com>
Closes#14441
During snapshot deletion ZFS may issue several reads for each deadlist
to merge them into next snapshot's or pool's bpobj. Number of the dead
lists increases with number of snapshots. On HDD pools it may take
significant time during which sync thread is blocked.
This patch introduces prescient prefetch of required blocks for up to
128 deadlists ahead. Tests show reduction of time required to delete
dataset with 720 snapshots with randomly overwritten file on wide HDD
pool from 75-85 to 22-28 seconds.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Issue #14276Closes#14402
Linux 6.2 changes the second argument of the set_acl operation to be a
"struct dentry *" rather than a "struct inode *". The inode* parameter
is still available as dentry->d_inode, so adjust the call to the _impl
function call to dereference and pass that pointer to it.
Also document that the get_acl -> get_inode_acl member name change from
commit 884a693 was an API change also introduced in Linux 6.2.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#14415
The .align directive used to align storage locations is
ambiguous. On some platforms and assemblers it takes a byte count,
on others the argument is interpreted as a shift value. The current
usage expects the first interpretation.
Replace it with the unambiguous .balign directive which always
expects a byte count, regardless of platform and assembler.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#14422
Introduce four new vdev properties:
checksum_n
checksum_t
io_n
io_t
These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.
When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.
Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.
The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Closes#13805
This is only ever used with unsigned data, so the type itself should be
unsigned. Also, PVS Studio's 2016 FreeBSD kernel report correctly
identified the following assertion as always being true, so we can drop
it:
ASSERT3U(dd->dd_space_towrite[i & TXG_MASK], >=, 0);
The reason it was always true is because it would do casts to give us
unsigned comparisons. This could have been fixed by switching to
`ASSERT3S()`, but upon inspection, it turned out that this variable
never should have been allowed to be signed in the first place.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14408
In original code, zfs_znode_dmu_fini is called in zfs_rmnode without
zfs_znode_hold_enter. It seems to assume it's ok to do so when the znode
is unlinked. However this assumption is not correct, as zfs_zget can be
called by NFS through zpl_fh_to_dentry as pointed out by Christian in
https://github.com/openzfs/zfs/pull/12767, which could result in a
use-after-free bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#12767Closes#14364
Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#14228
This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.
Authored-by: Sean Eric Fagan <sef@FreeBSD.org>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14376
I recently gained the ability to run Clang's static analyzer on the
linux kernel modules via a few hacks. This extended coverage to code
that was previously missed since Clang's static analyzer only looked at
code that we built in userspace. Running it against the Linux kernel
modules built from my local branch produced a total of 72 reports
against my local branch. Of those, 50 were reports of logic errors and
22 were reports of dead code. Since we already had cleaned up all of
the previous dead code reports, I felt it would be a good next step to
clean up these dead code reports. Clang did a further breakdown of the
dead code reports into:
Dead assignment 15
Dead increment 2
Dead nested assignment 5
The benefit of cleaning these up, especially in the case of dead nested
assignment, is that they can expose places where our error handling is
incorrect. A number of them were fairly straight forward. However
several were not:
In vdev_disk_physio_completion(), not only were we not using the return
value from the static function vdev_disk_dio_put(), but nothing used it,
so I changed it to return void and removed the existing (void) cast in
the other area where we call it in addition to no longer storing it to a
stack value.
In FSE_createDTable(), the function is dead code. Its helper function
FSE_freeDTable() is also dead code, as are the CPP definitions in
`module/zstd/include/zstd_compat_wrapper.h`. We just delete it all.
In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig()
returns 0 if there are waiting signals and 1 if there are none. The
Linux SPL version literally returns `signal_pending(current) ? 0 : 1)`
and FreeBSD implements the same semantics, we can just do
`!cv_wait_sig()` in place of `signal_pending(current)` to avoid
unnecessarily calling it again.
zfs_setattr() on FreeBSD version did not have error handling issue
because the code was removed entirely from FreeBSD version. The error is
from updating the attribute directory's files. After some thought, I
decided to propapage errors on it to userspace.
In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the
first check in favor of checking three other permissions. I assume this
is intentional.
In zfs_create_fs(), the return value of zap_update() was not checked
despite setting an important version number. I see no backward
compatibility reason to permit failures, so we add an assertion to catch
failures. Interestingly, Linux is still using ASSERT(error == 0) from
OpenSolaris while FreeBSD has switched to the improved ASSERT0(error)
from illumos, although illumos has yet to adopt it here. ASSERT(error ==
0) was used on Linux while ASSERT0(error) was used on FreeBSD since the
entire file needs conversion and that should be the subject of
another patch.
dnode_move()'s issue was caused by us not having implemented
POINTER_IS_VALID() on Linux. We have a stub in
`include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be
in `include/os/linux/spl/sys/kmem.h` to be consistent with
Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and
`POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we
copy what it did.
Whenever a report was in platform-specific code, I checked the FreeBSD
version to see if it also applied to FreeBSD, but it was only relevant a
few times.
Lastly, the patch that enabled Clang's static analyzer to be run on the
Linux kernel modules needs more work before it can be put into a PR. I
plan to do that in the future as part of the on-going static analysis
work that I am doing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14380
d_alias may need to be converted to du.d_alias
depending on the kernel version.
d_alias is currently in only one place in the code which
changes
"hlist_for_each_entry(dentry, &inode->i_dentry, d_alias)"
to
"hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)"
as neccesary.
This effectively results in a double macro expansion
for code that uses the zfs headers but already has its
own macro for just d_alias (lustre in this case).
Remove the conditional code for hlist_for_each_entry
and have a macro for "d_alias -> du.d_alias" instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gian-Carlo DeFazio <defazio1@llnl.gov>
Closes#14377
The Blocking Queue (bqueue) code is used by zfs send/receive to send
messages between the various threads. It uses a shared linked list,
which is locked whenever we enqueue or dequeue. For workloads which
process many blocks per second, the locking on the shared list can be
quite expensive.
This commit changes the bqueue logic to have 3 linked lists:
1. An enquing list, which is used only by the (single) enquing thread,
and thus needs no locks.
2. A shared list, with an associated lock.
3. A dequing list, which is used only by the (single) dequing thread,
and thus needs no locks.
The entire enquing list can be moved to the shared list in constant
time, and the entire shared list can be moved to the dequing list in
constant time. These operations only happen when the `fill_fraction` is
reached, or on an explicit flush request. Therefore, the lock only
needs to be acquired infrequently.
The API already allows for dequing to block until an explicit flush, so
callers don't need to be changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#14121
This commit supports for spare vdev hotplug. The
spare vdev associated with all the pools will be
marked as "Removed" when the drive is physically
detached and will become "Available" when the
drive is reattached. Currently, the spare vdev
status does not change on the drive removal and
the same is the case with reattachment.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14295
Every ARC buffer holds a reference on the header. It means headers with
buffers are never evictable. When we are evicting a header, there can
be no more buffers to free. Just assert that.
b_evict_lock seems not protecting anything now. Remove it.
Buffers checksum should also be freed with the last uncompressed buffer,
so it should not be there also when we are evicting the header.
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
The bi_rw member of struct bio was renamed to bi_opf in Linux 6.2.
As well, Linux's implementation of bio_set_op_attrs(...) has been
removed.
The HAVE_BIO_BI_OPF macro already appears to be defined, but the
removal of the bio_set_op_attrs(...) implementation makes the build
fall back on the locally-defined implementation, which isn't updated
for the bio->bi_opf change. This commit adds that update.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#14324Closes#14331
Linux 6.2 renamed the get_acl() operation to get_inode_acl() in
the inode_operations struct. This should fix Issue #14323.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#14323Closes#14331
On FreeBSD this reduces this structure size from 64 to 56 bytes.
dnode_handle_t respectively reduces from 72 to 64 bytes. It sounds
like a waste to need 72 bytes to be able to relocate 808 bytes of
dnode_t, which relocation on FreeBSD is not even supported.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14317
This saves 40 bytes per full ARC header, reducing it on FreeBSD from
240 to 200 bytes on production bits.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#14315
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14243
ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.
- Decouple prefetch status tracking from b_refcnt. It made sense
originally, but became highly cryptic over the years. Move all the
logic into arc_access(). While there, clean up and comment state
transitions in arc_access(). Some transitions were weird IMO.
- Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done(). To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
- Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
- Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
- Remove hash_lock argument from functions not needing it.
- Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
- Fix few false positive tests found in the process.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14123
There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:
A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.
A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).
Note that this was introduced by #12812.
This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.
Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.
Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#14239Closes#14289
This adds support to color zfs diff (in the style of git diff)
conditional on the ZFS_COLOR environment variable.
Signed-off-by: Ethan Coe-Renner <coerenner1@llnl.gov>
We currently compute a 64-bit hash three times, which consumes 0.8% CPU
time on ARC eviction heavy workloads. Caching the 64-bit value in the
dbuf allows us to avoid that overhead.
Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes#14251
If the fields to be listed and sorted by are constrained to those
populated by dsl_dataset_fast_stat(), then zfs list is much faster,
as it does not need to open each objset and reads its properties.
A previous optimization by Pawel Dawidek
(0cee24064a) took advantage
of this to make listing snapshot names sorted only by name much faster.
However, it was limited to `-o name -s name`, this work extends this
optimization to work with:
- name
- guid
- createtxg
- numclones
- inconsistent
- redacted
- origin
and could be further extended to any other properties supported by
dsl_dataset_fast_stat() or similar, that do not require extra locking
or reading from disk.
This was committed before (9a9e2e343dfa2af28bf7910de77ae73aa006de62),
but was reverted due to a regression when used with an older kernel.
If the kernel does not populate zc->zc_objset_stats, we now fallback
to getting the properties via the slower interface, to avoid problems
with newer userland and older kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#14110
We do a simple ifdef to avoid calling enable_kernel_spe()/
disable_kernel_spe() on PowerPC.
Reported-by: Rich Ercolani <Rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14233Closes#14244
There was the series from me a year ago which fixed most of the
callback vs implementation prototype mismatches. It was based on
running the CFI-enabled kernel (in permissive mode -- warning
instead of panic) and performing a full ZTS cycle, and then fixing
all of the problems caught by CFI.
Now, Clang 16-dev has new warning flag, -Wcast-function-type-strict,
which detect such mismatches at compile-time. It allows to find the
remaining issues missed by the first series.
There are only two of them left: one for the
secpolicy_vnode_setattr() callback and one for taskq_dispatch().
The fix is easy, since they are not used anywhere else.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes#14207
I've noticed that some of those counters are used in hot paths like
dnode_hold_impl(), and results of this change is visible in profiler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#14198
It is protected by z_hold_locks, so we do not need more serialization,
simple integer math should be fine.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#14196
Linux defaults to setting "failfast" on BIOs, so that the OS will not
retry IOs that fail, and instead report the error to ZFS.
In some cases, such as errors reported by the HBA driver, not
the device itself, we would wish to retry rather than generating
vdev errors in ZFS. This new property allows that.
This introduces a per vdev option to disable the failfast option.
This also introduces a global module parameter to define the failfast
mask value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Submitted-by: Klara, Inc.
Closes#14056
If there were no zil entries to replay, skip zil_close. zil_close waits
for a transaction to sync. That can take several seconds, for example
during pool import of a resilvering pool. Skipping zil_close can cut
the time for "zpool import" from 2 hours to 45 seconds on a resilvering
pool with a thousand zvols.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Sponsored-by: Axcient
Closes#13999Closes#14015
Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping
infrastructure to support idmapped mounts of filesystems mounted
with an idmapping". Update the OpenZFS accordingly to improve the
idmapped mount support.
This pull request contains the following changes:
- xattr setter functions are fixed to take mnt_ns argument. Without
this, cp -p would fail for an idmapped mount in a user namespace.
- idmap_util is enhanced/fixed for its use in a user ns context.
- One test case added to test idmapped mount in a user ns.
Reviewed-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14097
Most of this file was a pile of defines, apparently from Solaris that
controlled nothing in the source tree. A few things controlled the
definition of unused types or macros which I have removed.
Considerable further cleanup is possible including removal of
architectures FreeBSD never supported. This file should likely converge
with the Linux version to the extent possible.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14127
Only DKIOCFLUSHWRITECACHE is required.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14127
Require that ZFS_LEGACY_SUPPORT be defined for legacy ioctl support to
be built. For now, define it in zfs_ioctl_compat.h so support is always
built. This will allow systems that need never support pre-openzfs
tools a mechanism to remove support at build time. This code should
be removed once the need for tool compatability is gone.
No functional change at this time.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14127
vn_renamepath() is a Solaris-ism that was defined away in the FreeBSD
port. Now that the only use is in the FreeBSD zfs_vnops_os.c, drop it
entierly.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14127
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14127
This fixes -Wsingle-bit-bitfield-constant-conversion warning from
clang-16 like:
lib/libzfs/libzfs_dataset.c:4529:19: error: implicit truncation
from 'int' to a one-bit wide bit-field changes value from
1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
flags.nounmount = B_TRUE;
^ ~~~~~~
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14125
uu_avl and uu_list stored internal next/prev pointers and parent
pointers (unused) obfuscated (byte swapped) to hide them from a long
forgotten leak checker (No one at the 2022 OpenZFS developers meeting
could recall the history.) This would break on CHERI systems and adds
no obvious value. Rename the members, use proper types rather than
uintptr_t, and eliminate the related macros.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14126
Avoid assuming than a uint64_t can hold a pointer and reduce the
number of casts in the process.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14131
Avoid assuming that a pointer can fit in a uint64_t and use uintptr_t
instead.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14131
Nothing consumes this definition so stop defining it.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14128
Check __riscv_xlen == 64 rather than _LP64 and define _LP64 if missing.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14128
`snprintf()` is meant to protect against buffer overflows, but operating
on the buffer using its return value, possibly by calling it again, can
cause a buffer overflow, because it will return how many characters it
would have written if it had enough space even when it did not. In a
number of places, we repeatedly call snprintf() by successively
incrementing a buffer offset and decrementing a buffer length, by its
return value. This is a potentially unsafe usage of `snprintf()`
whenever the buffer length is reached. CodeQL complained about this.
To fix this, we introduce `kmem_scnprintf()`, which will return 0 when
the buffer is zero or the number of written characters, minus 1 to
exclude the NULL character, when the buffer was too small. In all other
cases, it behaves like snprintf(). The name is inspired by the Linux and
XNU kernels' `scnprintf()`. The implementation was written before I
thought to look at `scnprintf()` and had a good name for it, but it
turned out to have identical semantics to the Linux kernel version.
That lead to the name, `kmem_scnprintf()`.
CodeQL only catches this issue in loops, so repeated use of snprintf()
outside of a loop was not caught. As a result, a thorough audit of the
codebase was done to examine all instances of `snprintf()` usage for
potential problems and a few were caught. Fixes for them are included in
this patch.
Unfortunately, ZED is one of the places where `snprintf()` is
potentially used incorrectly. Since using `kmem_scnprintf()` in it would
require changing how it is linked, we modify its usage to make it safe,
no matter what buffer length is used. In addition, there was a bug in
the use of the return value where the NULL format character was not
being written by pwrite(). That has been fixed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14098
The previous version reported all the right info, but the VERIFY3 name
made a little more confusing when looking for the matching location in
the source code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob N ★ <robn@despairlabs.com>
Closes#14099
Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.
In order for us to represent the new renameat2(2) flags in the ZIL, we
create two new transaction types for the two flags which need
transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT).
RENAME_NOREPLACE does not need any ZIL support because we know that if
the operation succeeded before creating the ZIL entry, there was no file
to be clobbered and thus it can be treated as a regular TX_RENAME.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#12209Closes#14070
This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support
for ZoL, but the changes here allow for far nicer fallbacks than the
previous implementation (the source and target are re-linked in case of
the final link failing).
In addition, a small cleanup was done for the "target exists but is a
different type" codepath so that it's more understandable.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#12209Closes#14070
This allows for much cleaner VERIFY-level assertions.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#14070
Open files, which aren't present in the snapshot, which is being
roll-backed to, need to disappear from the visible VFS image of
the dataset.
Kernel provides d_drop function to drop invalid entry from
the dcache, but inode can be referenced by dentry multiple dentries.
The introduced zpl_d_drop_aliases function walks and invalidates
all aliases of an inode.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#9600Closes#14070
Follow up for 4938d01d which changed zio_flag from enum to uint64_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14100
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes#14086
vm_object_page_clean() expects that the associated vnode is locked
as VOP_PUTPAGES() may get called on the vnode.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#14079
The use of __noreturn__ in 55d7afa4ad on
spl_panic() caused objtool warnings on Linux when the kernel is built
with CONFIG_STACK_VALIDATION=y. This patch works around that by
restricting the application of __noreturn__ to builds for static
analyzers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14068
Microzap on-disk format does not include a hash tree, expecting one to
be built in RAM during mzap_open(). The built tree is linked to DMU
user buffer, freed when original DMU buffer is dropped from cache. I've
found that workloads accessing many large directories and having active
eviction from DMU cache spend significant amount of time building and
then destroying the trees. I've also found that for each 64 byte mzap
element additional 64 byte tree element is allocated, that is a waste
of memory and CPU caches.
Improve memory efficiency of the hash tree by switching from AVL-tree
to B-tree. It allows to save 24 bytes per element just on pointers.
Save 32 bits on mze_hash by storing only upper 32 bits since lower 32
bits are always zero for microzaps. Save 16 bits on mze_chunkid, since
microzap can never have so many elements. Respectively with the 16 bits
there can be no more than 16 bits of collision differentiators. As
result, struct mzap_ent now drops from 48 (rounded to 64) to 8 bytes.
Tune B-trees for small data. Reduce BTREE_CORE_ELEMS from 128 to 126
to allow struct zfs_btree_core in case of 8 byte elements to pack into
2KB instead of 4KB. Aside of the microzaps it should also help 32bit
range trees. Allow custom B-tree leaf size to reduce memmove() time.
Split zap_name_alloc() into zap_name_alloc() and zap_name_init_str().
It allows to not waste time allocating/freeing memory when processing
multiple names in a loop during mzap_open().
Together on a pool with 10K directories of 1800 files each and DMU
cache limited to 128MB this reduces time of `find . -name zzz` by 41%
from 7.63s to 4.47s, and saves additional ~30% of CPU time on the DMU
cache reclamation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14039