= Problem
While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
/*
* Avoid writing single-copy data to a failing,
* non-redundant vdev, unless we've already tried all
* other vdevs.
*/
if ((vd->vdev_stat.vs_write_errors > 0 ||
vd->vdev_state < VDEV_STATE_HEALTHY) &&
d == 0 && !try_hard && vd->vdev_children == 0) {
metaslab_trace_add(zal, mg, NULL, psize, d,
TRACE_VDEV_ERROR, allocator);
goto next;
}
```
= Proposed Solution
Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).
= Testing
I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME SIZE ALLOC FREE
volpool 22.5G 117M 22.4G
xvdb 7.99G 40.2M 7.46G
xvdc 7.99G 39.1M 7.46G
xvdd 7.99G 37.8M 7.46G
```
And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```
And got the vdevs to the following state:
```
$ zpool status volpool
pool: volpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:
NAME STATE READ WRITE CKSUM
volpool ONLINE 0 0 0
xvdb ONLINE 0 1 0
xvdc ONLINE 0 1 0
xvdd ONLINE 0 4 0
```
I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0 # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```
Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
ganged count: 1384
$ sudo zdb volpool | grep gang
ganged count: 1393
$ sudo zdb volpool | grep gang
ganged count: 1402
$ sudo zdb volpool | grep gang
ganged count: 1414
```
Then I updated my bits with this patch and the gang count stayed the
same.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#14003
If no errors are encountered, we read an uninitialized error value.
Clang's static analyzer complained about this.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14007
The metaslab_check_free() function only needs to be called in the
GANG|DEDUP|etc case because zio_free_sync() will internally call
metaslab_check_free().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Finix1979 <yancw@info2soft.com>
Closes#13977
If the `list_head()` returns NULL, we dereference it, right before we
check to see if it returned NULL.
We have defined two different pointers that both point to the same
thing, which are `origin_head` and `origin_ds`. Almost everything uses
`origin_ds`, so we switch them to use `origin_ds`.
We also promote `origin_ds` to a const pointer so that the compiler
verifies that nothing modifies it.
Coverity complained about this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13967
Some header files define structures like this one:
typedef const struct zio_checksum_info {
/* ... */
const char *ci_name;
} zio_abd_checksum_func_t;
So we can use `zio_abd_checksum_func_t` for const declarations now.
It's not needed that we use the `const` qualifier again like this:
`const zio_abd_checksum_func_t *varname;`
This patch solves the double const qualifiers, which were found by
smatch.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13961
= Issue
Recently we hit an assertion panic in `dsl_process_sub_livelist` while
exporting the spa and interrupting `bpobj_iterate_nofree`. In that case
`bpobj_iterate_nofree` stops mid-way returning an EINTR without clearing
the intermediate AVL tree that keeps track of the livelist entries it
has encountered so far. At that point the code has a VERIFY for the
number of elements of the AVL expecting it to be zero (which is not the
case for EINTR).
= Fix
Cleanup any intermediate state before destroying the AVL when
encountering EINTR. Also added a comment documenting the scenario where
the EINTR comes up. There is no need to do anything else for the calles
of `dsl_process_sub_livelist` as they already handle the EINTR case.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#13939
2a493a4c71 was intended to fix all
instances of coverity reported unchecked return values, but
unfortunately, two were missed by mistake. This commit fixes the
unchecked return values that had been missed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13945
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event. This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#13797
The current value causes significant artificial slowdown during mass
parallel file removal, which can be observed both on FreeBSD and Linux
when running real workloads.
Sample results from Linux doing make -j 96 clean after an allyesconfig
modules build:
before: 4.14s user 6.79s system 48% cpu 22.631 total
after: 4.17s user 6.44s system 153% cpu 6.927 total
FreeBSD results in the ticket.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#13932Closes#13938
Coverity caught unsafe use of `strcpy()` in `ztest_dmu_objset_own()`,
`nfs_init_tmpfile()` and `dump_snapshot()`. It also caught an unsafe use
of `strlcat()` in `nfs_init_tmpfile()`.
Inspired by this, I did an audit of every single usage of `strcpy()` and
`strcat()` in the code. If I could not prove that the usage was safe, I
changed the code to use either `strlcpy()` or `strlcat()`, depending on
which function was originally used. In some cases, `snprintf()` was used
to replace multiple uses of `strcat` because it was cleaner.
Whenever I changed a function, I preferred to use `sizeof(dst)` when the
compiler is able to provide the string size via that. When it could not
because the string was passed by a caller, I checked the entire call
tree of the function to find out how big the buffer was and hard coded
it. Hardcoding is less than ideal, but it is safe unless someone shrinks
the buffer sizes being passed.
Additionally, Coverity reported three more string related issues:
* It caught a case where we do an overlapping memory copy in a call to
`snprintf()`. We fix that via `kmem_strdup()` and `kmem_strfree()`.
* It caught `sizeof (buf)` being used instead of `buflen` in
`zdb_nicenum()`'s call to `zfs_nicenum()`, which is passed to
`snprintf()`. We change that to pass `buflen`.
* It caught a theoretical unterminated string passed to `strcmp()`.
This one is likely a false positive, but we have the information
needed to do this more safely, so we change this to silence the false
positive not just in coverity, but potentially other static analysis
tools too. We switch to `strncmp()`.
* There was a false positive in tests/zfs-tests/cmd/dir_rd_update.c. We
suppress it by switching to `snprintf()` since other static analysis
tools might complain about it too. Interestingly, there is a possible
real bug there too, since it assumes that the passed directory path
ends with '/'. We add a '/' to fix that potential bug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13913
Coverity complains about a possible NULL pointer dereference. This is
impossible, but it suspects it because we do a NULL check against
`spa->spa_root_vdev`. This NULL check was never necessary and makes the
code harder to understand, so we drop it.
In particular, we dereference `spa->spa_root_vdev` when `new_state !=
POOL_STATE_UNINITIALIZED && !hardforce`. The first is only true when
spa_reset is called, which only occurs under fault injection. The
second is true unless `zpool export -F $POOLNAME` is used. Therefore,
we effectively *always* dereference the pointer. In the cases where we
do not, there is no reason to think it is unsafe. Therefore this change
is safe to make.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13905
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13875
Coverity found a bug in `zfs_secpolicy_create_clone()` where it is
possible for us to pass an unterminated string when `zfs_get_parent()`
returns an error. Upon inspection, it is clear that using `strlcpy()`
would have avoided this issue.
Looking at the codebase, there are a number of other uses of `strncpy()`
that are unsafe and even when it is used safely, switching to
`strlcpy()` would make the code more readable. Therefore, we switch all
instances where we use `strncpy()` to use `strlcpy()`.
Unfortunately, we do not portably have access to `strlcpy()` in
tests/zfs-tests/cmd/zfs_diff-socket.c because it does not link to
libspl. Modifying the appropriate Makefile.am to try to link to it
resulted in an error from the naming choice used in the file. Trying to
disable the check on the file did not work on FreeBSD because Clang
ignores `#undef` when a definition is provided by `-Dstrncpy(...)=...`.
We workaround that by explictly including the C file from libspl into
the test. This makes things build correctly everywhere.
We add a deprecation warning to `config/Rules.am` and suppress it on the
remaining `strncpy()` usage. `strlcpy()` is not portably avaliable in
tests/zfs-tests/cmd/zfs_diff-socket.c, so we use `snprintf()` there as a
substitute.
This patch does not tackle the related problem of `strcpy()`, which is
even less safe. Thankfully, a quick inspection found that it is used far
more correctly than strncpy() was used. A quick inspection did not find
any problems with `strcpy()` usage outside of zhack, but it should be
said that I only checked around 90% of them.
Lastly, some of the fields in kstat_t varied in size by 1 depending on
whether they were in userspace or in the kernel. The origin of this
discrepancy appears to be 04a479f706 where
it was made for no apparent reason. It conflicts with the comment on
KSTAT_STRLEN, so we shrink the kernel field sizes to match the userspace
field sizes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13876
When receiving full/newfs on existing dataset, then it should be done
with "-F" flag. Its enforced for initial receive in checks done in
zfs_receive_one function of libzfs. Similarly, on resuming full/newfs
recv on existing dataset, it should be done with "-F" flag.
When dataset doesn't exist, then full/new recv is done on newly created
dataset and it's marked INCONSISTENT. But when receiving on existing
dataset, recv is first done on %recv and its marked INCONSISTENT.
Existing dataset is not marked INCONSISTENT. Resume of full/newfs
receive with dataset not INCONSISTENT indicates that its resuming newfs
on existing dataset. So, enforce "-F" flag in this case.
Also return an error from dmu_recv_resume_begin_check() in zfs kernel,
when its resuming full/newfs recv without force.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#13856Closes#13857
Recently, I have been making a push to fix things that coverity found.
However, I was curious what Clang's static analyzer reported, so I ran
it and found things that coverity had missed.
* contrib/pam_zfs_key/pam_zfs_key.c: If prop_mountpoint is passed more
than once, we leak memory.
* module/zfs/zcp_get.c: We leak memory on temporary properties in
userspace.
* tests/zfs-tests/cmd/draid.c: On error from vdev_draid_rand(), we leak
memory if best_map had been allocated by a prior iteration.
* tests/zfs-tests/cmd/mkfile.c: Memory used by the loop is not freed
before program termination.
Arguably, these are all minor issues, but if we ignore them, then they
could obscure serious bugs, so we fix them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13955
Coverity found a number of places where we either do MAX(unsigned, 0) or
do assertions that a unsigned variable is >= 0. These do nothing, so
let us drop them all.
It also found a spot where we do `if (unsigned >= 0 && ...)`. Let us
also drop the unsigned >= 0 check.
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13871
Coverity complained about unchecked return values and unused values that
turned out to be unused return values.
Different approaches were used to handle the different cases of
unchecked return values:
* cmd/zdb/zdb.c: VERIFY0 was used in one place since the existing code
had no error handling. An error message was printed in another to
match the rest of the code.
* cmd/zed/agents/zfs_retire.c: We dismiss the return value with `(void)`
because the value is expected to be potentially unset.
* cmd/zpool_influxdb/zpool_influxdb.c: We dismiss the return value with
`(void)` because the values are expected to be potentially unset.
* cmd/ztest.c: VERIFY0 was used since we want failures if something goes
wrong in ztest.
* module/zfs/dsl_dir.c: We dismiss the return value with `(void)`
because there is no guarantee that the zap entry will always be there.
For example, old pools imported readonly would not have it and we do
not want to fail here because of that.
* module/zfs/zfs_fm.c: `fnvlist_add_*()` was used since the
allocations sleep and thus can never fail.
* module/zfs/zvol.c: We dismiss the return value with `(void)` because
we do not need it. This matches what is already done in the analogous
`zfs_replay_write2()`.
* tests/zfs-tests/cmd/draid.c: We suppress one return value with
`(void)` since the code handles errors already. The other return value
is handled by switching to `fnvlist_lookup_uint8_array()`.
* tests/zfs-tests/cmd/file/file_fadvise.c: We add error handling.
* tests/zfs-tests/cmd/mmap_sync.c: We add error handling for munmap, but
ignore failures on remove() with (void) since it is expected to be
able to fail.
* tests/zfs-tests/cmd/mmapwrite.c: We add error handling.
As for unused return values, they were all in places where there was
error handling, so logic was added to handle the return values.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13920
Incorrectly sizing the array of hash locks used to protect the
dbuf hash table can lead to contention and reduce performance.
We could unconditionally allocate a larger array for the locks
but it's wasteful, particularly for a low-memory system.
Instead, dynamically allocate the array of locks and scale
it based on total system memory.
Additionally, add a new `dbuf_mutex_cache_shift` module option
which can be used to override the hash lock array size. This is
disabled by default (dbuf_mutex_hash_shift=0) and can only be
set at module load time. The minimum target array size is set
to 8192, this matches the current constant value.
Note that the count of the dbuf hash table and count of the
mutex array were added to the /proc/spl/kstat/zfs/dbufstats
kstat.
Finally, this change removes the _KERNEL conditional checks.
These were not required since for the user space build there
is no difference between the kmem and vmem interfaces.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13928
This reverts commit 34dbc618f5. While this
change resolved the lock contention observed for certain workloads, it
inadventantly reduced the maximum hash inserts/removes per second. This
appears to be due to the slightly higher acquisition cost of a rwlock vs
a mutex.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Coverity complains about this. It is not a bug as long as we never shift
by more than 31, but it is not terrible to change the constants from 1
to 1ULL as clean up.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13914
For encrypted raw receive, objset creation is delayed until a call to
dmu_recv_stream(). ZFS_PROP_SHARESMB property requires objset to be
populated when calling zpl_earlier_version(). To correctly handle the
ZFS_PROP_SHARESMB property for encrypted raw receive, this change
delays setting the property.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#13878
Coverity complained about the format specifiers not matching variables.
In one case, the variable is a constant, so we fix it. In another, we
were missing an argument (about which coverity also complained).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13888
Apply similar options to BLAKE3 as it is done for zfs_fletcher_4_impl.
The zfs module parameter on Linux changes from icp_blake3_impl to
zfs_blake3_impl.
You can check and set it on Linux via sysfs like this:
```
[bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl
cycle [fastest] generic sse2 sse41 avx2
[bash]# echo sse2 > /sys/module/zfs/parameters/zfs_blake3_impl
[bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl
cycle fastest generic [sse2] sse41 avx2
```
The modprobe module parameters may also be used now:
```
[bash]# modprobe zfs zfs_blake3_impl=sse41
[bash]# cat /sys/module/zfs/parameters/zfs_blake3_impl
cycle fastest generic sse2 [sse41] avx2
```
On FreeBSD the BLAKE3 implementation can be set via sysctl like this:
```
[bsd]# sysctl vfs.zfs.blake3_impl
vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 avx2
[bsd]# sysctl vfs.zfs.blake3_impl=sse2
vfs.zfs.blake3_impl: cycle [fastest] generic sse2 sse41 avx2 \
-> cycle fastest generic [sse2] sse41 avx2
```
This commit changes also some Blake3 internals like these:
- blake3_impl_ops_t was renamed to blake3_ops_t
- all functions are named blake3_impl_NAME() now
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13725
- Some optimizations for bqueue enqueue/dequeue.
- Added a fix to prevent deadlock when both bqueue_enqueue_impl()
and bqueue_dequeue() waits for signal to be triggered.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#13855
Replace ZFS_ENTER and ZFS_VERIFY_ZP, which have hidden returns, with
functions that return error code. The reason we want to do this is
because hidden returns are not obvious and had caused some missing fail
path unwinding.
This patch changes the common, linux, and freebsd parts. Also fixes
fail path unwinding in zfs_fsync, zpl_fsync, zpl_xattr_{list,get,set}, and
zfs_lookup().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#13831
I see a few issues in the issue tracker that might be aided by being
able to turn this on. We have no module parameter for it, so I would
like to add one.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13874
We pass sizeof (struct redact_record *) rather than sizeof (struct
redact_record). Passing the pointer size is wrong.
Coverity caught this in two places.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13885
Coverity pointed out that if we somehow receive SPA_FEATURE_NONE, we
will use a negative number as an array index. A defensive assertion
seems appropriate.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13872
Coverity complains about a "use-after-free" bug in
`dbuf_prefetch_indirect_done()` because we use a pointer value after
freeing its buffer. The pointer is used for refcounting in ARC (as the
reference holder). There is a theoretical situation where the pointer
would be reused in a way that causes the refcounting to collide, so we
change the order in which we call arc_buf_destroy() and
dbuf_prefetch_fini() to match the rest of the function. This prevents
the theoretical situation from being a possibility.
Also, we have a few return statements with a value, despite this being a
void function. We clean those up while we are making changes here.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13869
We inherited membar_consumer() and membar_producer() from OpenSolaris,
but we had replaced membar_consumer() with Linux's smp_rmb() in
zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the
Linux-only smp_rmb().
We reinstate membar_consumer() in platform independent code and fix the
FreeBSD SPL to implement membar_consumer() in a way analogous to Linux.
Reviewed-by: Konstantin Belousov <kib@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13843
Unused code detected by coverity.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13868
Coverity reported this as an out-of-bounds read.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13865
Coverty static analysis found these.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#10989Closes#13861
In our codebase, `cond_resched() and `schedule()` are Linux kernel
functions that have replaced the OpenSolaris `kpreempt()` functions in
the codebase to such an extent that `kpreempt()` in zfs_context.h was
broken. Nobody noticed because we did not actually use it. The header
had defined `kpreempt()` as `yield()`, which works on OpenSolaris and
Illumos where `sched_yield()` is a wrapper for `yield()`, but that does
not work on any other platform.
The FreeBSD platform specific code implemented shims for these, but the
shim for `schedule()` forced us to wait, which is different than merely
rescheduling to another thread as the original Linux code does, while
the shim for `cond_resched()` had the same definition as its kernel
kpreempt() shim.
After studying this, I have concluded that we should reintroduce the
kpreempt() function in platform independent code with the following
definitions:
- In the Linux kernel:
kpreempt(unused) -> cond_resched()
- In the FreeBSD kernel:
kpreempt(unused) -> kern_yield(PRI_USER)
- In userspace:
kpreempt(unused) -> sched_yield()
In userspace, nothing changes from this cleanup. In the kernels, the
function `fm_fini()` will now call `kern_yield(PRI_USER)` on FreeBSD and
`cond_resched()` on Linux. This is instead of `pause("schedule", 1)` on
FreeBSD and `schedule()` on Linux. This makes our behavior consistent
across platforms.
Note that Linux's SPL continues to use `cond_resched()` and
`schedule()`. However, those functions have been removed from both the
FreeBSD code and userspace code.
This should have the benefit of making it slightly easier to port the
code to new platforms by making how things should be mapped less
confusing.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13845
When iterating through children physical ashifts for vdev, prefer
ones above the maximum logical ashift, that we can actually use,
but within the administrator defined maximum.
When selecting top-level vdev ashift, do not set it to the defined
maximum in case physical ashift is even higher, but just ignore one.
Using the maximum does not prevent misaligned writes, but reduces
space efficiency. Since ZFS tries to write data sequentially and
aggregates the writes, in many cases large misanigned writes may be
not as bad as the space penalty otherwise.
Allow internal physical ashifts for vdevs higher than SHIFT_MAX.
May be one day allocator or aggregation could benefit from that.
Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB),
so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB),
but only if it really has to or explicitly told to, but not as an
"optimization".
There are some read-intensive NVMe SSDs that report Preferred Write
Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a
space inefficiency that can't be justified. Instead these changes
make ZFS fall back to logical ashift of 12 (4KB) by default and
only warn user that it may be suboptimal for performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#13798
This commit adds DD_FIELD string used in extensified dsl_dir zap object
for snapshots_changed property.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#13819
Only the single snapshot rename is provided.
The recursive or more complex rename can be scripted.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#13802
It makes sense to free memory in smaller chunks when approaching
arc_c_min to let other kernel subsystems to free more, since after
that point we can't free anything. This also matches behavior on
Linux, where to shrinker reported only the size above arc_c_min.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#13794
This reverts commit 80a650b7bb. This change
inadvertently introduced a regression in ztest where one of the new ASSERTs
is triggered in dsl_scan_visitbp().
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12275Closes#13799
Currently, snapshots_changed property is stored in dd_props_zapobj, due
to which the property is assumed to be local. This causes a difference
in behavior with respect to other readonly properties.
This commit stores the snapshots_changed property in dd_object. Source
is not set to local in this case, which makes it consistent with other
readonly properties.
This commit also updates the date string format to include seconds.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#13785
When scrubbing an encrypted filesystem with unloaded key still report an
error in zpool status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#13675Closes#13717
There are a couple changes included here. The first is to introduce
a cap on the size the ZED will grow the zevent list to. One million
entries is more than enough for most use cases, and if you are
overflowing that value, the problem needs to be addressed another
way. The value is also tunable, for those who want the limit to be
higher or lower.
The other change is to add a kernel module parameter that allows
snapshot creation/deletion to be exempted from the history logging;
for most workloads, having these things logged is valuable, but for
some workloads it produces large quantities of log spam and isn't
especially helpful.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Issue #13374Closes#13753
Thanks to George Wilson for clarifying this on Slack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes#13698
This is a small cleanup for a trivial problem which happened to
be noticed while another issue was being investigated.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#13730
Make dd_snap_cmtime property persistent across mount and unmount
operations by storing in ZAP and restore the value from ZAP on hold
into dd_snap_cmtime instead of updating it.
Expose dd_snap_cmtime as 'snapshots_changed' property that provides a
mechanism to quickly determine whether snapshot list for dataset has
changed without having to mount a dataset or iterate the snapshot list.
It specifies the time at which a snapshot for a dataset was last
created or deleted. This allows us to be more efficient how often we
query snapshots.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#13635
The checksum benchmarking on module load may take a really long time
on embedded systems with a slow cpu. Avoid all benchmarks >= 1MiB on
systems, where EdonR is slower then 300 MiB/s.
This limit is currently hardcoded via the define LIMIT_PERF_MBS.
This is the new benchmark output of a slow Intel Atom:
```
implementation 1k 4k 16k 64k 256k 1m 4m 16m
edonr-generic 209 257 268 259 262 0 0 0
skein-generic 129 150 151 150 150 0 0 0
sha256-generic 50 55 56 56 56 0 0 0
sha512-generic 76 86 88 89 88 0 0 0
blake3-generic 63 62 62 62 61 0 0 0
blake3-sse2 114 292 301 307 309 0 0 0
```
Reviewed-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13695
This type of recv is used to heal corrupted data when a replica
of the data already exists (in the form of a send file for example).
With the provided send stream, corrective receive will read from
disk blocks described by the WRITE records. When any of the reads
come back with ECKSUM we use the data from the corresponding WRITE
record to rewrite the corrupted block.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Closes#9372
- When iterating snapshots with name only, e.g., "-o name -s name",
libzfs uses simple snapshot iterator and results are displayed
in alphabetic order. This PR adds support for faster version of
createtxg sort by avoiding nvlist parsing for properties. Flags
"-o name -s createtxg" will enable createtxg sort while using
simple snapshot iterator.
- Added support to read createtxg property directly from zfs handle
for filesystem, volume and snapshot types instead of parsing nvlist.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#13577
ZIL kstats are reported in an inclusive way, i.e., same counters are
shared to capture all the activities happening in zil. Added support
to report zil stats for every datset individually by combining them
with already exposed dataset kstats.
Wmsum uses per cpu counters and provide less overhead as compared
to atomic operations. Updated zil kstats to replace wmsum counters
to avoid atomic operations.
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#13636
It may happen that scan bookmark points to a block that was turned
into a part of a big hole. In such case dsl_scan_visitbp() may skip
it and dsl_scan_check_resume() will not be called for it. As result
new scan suspend won't be possible until the end of the object, that
may take hours if the object is a multi-terabyte ZVOL on a slow HDD
pool, stretching TXG to all that time, creating all sorts of problems.
This patch changes the resume condition to any greater or equal block,
so even if we miss the bookmarked block, the next one we find will
delete the bookmark, allowing new suspend.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13643
Allocation via kmem_cache_alloc() is limited to less then 4m for
some architectures.
This commit limits the benchmarks with the linear abd cache to 1m
on all architectures and adds 4m + 16m benchmarks via non-linear
abd_alloc().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13669Closes#13670
Fixes a small kernel memory leak which would occur if a pool failed
to import because the `DMU_POOL_VDEV_ZAP_MAP` key can't be read from
a presumably damaged MOS config. In the case of a missing key there
was no leak.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Finix1979 <yancw@info2soft.com>
Closes#13629
Before this change for every valid parity column raidz_parity_verify()
allocated new buffer and copied there existing data, then recalculated
the parity and compared the result with the copy. This patch removes
the memory copy, simply swapping original buffer pointers with newly
allocated empty ones for parity recalculation and comparison. Original
buffers with potentially incorrect parity data are then just freed,
while new recalculated ones are used for repair.
On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this
change reduces memory traffic during scrub by 17% and total unhalted
CPU time by 25%.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13613
Issuing several scrub reads for a block we may use the parent ZIO
buffer for one of child ZIOs. If that read complete successfully,
then we won't need to copy the data explicitly. If block has only
one copy (typical for root vdev, which is also a mirror inside),
then we never need to copy -- succeed or fail as-is. Previous
code also copied data from buffer of every successfully completed
child ZIO, but that just does not make any sense.
On healthy N-wide mirror this saves all N+1 (or even more in case
of ditto blocks) memory copies for each scrubbed block, allowing
CPU to focus mostly on check-summing. For other vdev types it
should save one memory copy per block copy at root vdev.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13606
If a dnode has a spill pointer, and we use DN_SLOTS_TO_BONUSLEN() then
we will possibly include the spill pointer in the len calculation and it
will be byteswapped. Then dnode_byteswap() will carry on and swap the
spill pointer again. Fix this by using DN_MAX_BONUS_LEN() instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#13002Closes#13015
Block statistics calculation during scrub I/O issue in case of sorted
scrub accounted ditto blocks several times. Embedded blocks on other
side were not accounted at all. This change moves the accounting from
issue to scan stage, that fixes both problems and also allows to avoid
pool-wide locking and the lock contention it created.
Since this statistics is quite specific and is not even exposed now
anywhere, disable its calculation by default to not waste CPU time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13579
Move the use of the db pointer after it is freed. It's only used as
a tag so a dereference would never occur, but there's no reason we
can't invert the order to resolve the warning.
module/zfs/dbuf.c: In function 'dbuf_destroy':
module/zfs/dbuf.c:2953:17: error:
pointer 'db' may be used after 'free' [-Werror=use-after-free]
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
Move the use of the private pointer after it is freed. It's only
used as a tag so a dereference would never occur, but there's no
harm in inverting the order to resolve the warning.
module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done':
module/zfs/dbuf.c:3204:17: error:
pointer 'private' may be used after 'free' [-Werror=use-after-free]
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
The memcpy(), memmove(), and memset() functions have been annotated
to perform bounds checking when using FORTIFY_SOURCE. A warning is
now generted when writing beyond the end of the specified field.
Alternately, the new struct_group() macro could be used to create
an anonymous union member for use by memcpy(). However, since this
is the only place the macro would be helpful it's preferable to
restructure the code slights to avoid the need for additional
compatibility code when the macro does not exist.
https://lore.kernel.org/lkml/20211118183807.1283332-1-keescook@chromium.org/T/
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
Restructure the code in zfs_log_xvattr() to use a lr_attr_end
structure when accessing lr_attr_t elements located after the
variable sized array. This makes the code more understandable
and resolves the accessing beyond the end of the field warnings.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13528Closes#13575
The current codebase does not support raw sending buffers with block
size > 128kB when large_blocks is not active. This can happen in the
codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which
calls back dmu_objset_write_done()->dsl_dataset_block_born(). If
dsl_dataset_sync() completes its run before dsl_dataset_block_born() is
called, we will end up not activating some of the necessary flags, while
having blocks based on those flags written in the filesystem. A
subsequent send will then panic.
Fix this by directly deciding in dmu_objset_sync() whether these flags
need to be activated later by dsl_dataset_sync(). Instead of panicking
due to a NULL pointer dereference in dmu_dump_write() in case of a send,
print out an error message. Also during scrub verify there are no
contradicting filesystem flags.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12275Closes#12438
Change math to make it like the ARC, using multiplications instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13591
- Introduce first element offset within a leaf. It allows to reduce
by ~50% average memmove() size when adding/removing elements. If the
added/removed element is in the first half of the leaf, we may shift
elements before it and adjust the bth_first instead of moving more
elements after it.
- Use memcpy() instead of memmove() when we know there is no overlap.
- Switch from uint64_t to uint32_t. It does not limit anything,
but 32-bit arches should appreciate it greatly in hot paths.
- Store leaf capacity in struct btree to avoid 64-bit divisions.
- Adjust zfs_btree_insert_into_leaf() to always result in balanced
leaves after splitting, no matter where the new element was inserted.
Not that we care about it much, but it should also allow B-trees with
as little as two elements per leaf instead of 4 previously.
When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this
reduces amount of time spent in memmove() inside the scan thread from
13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes.
It should also reduce spacemaps load time, but I haven't measured it.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13582
- Reduce size and comparison complexity of q_exts_by_size B-tree.
Previous code used two 64-bit divisions and many other operations to
compare two B-tree elements. It created enormous overhead. This
implementation moves the math to the upper level and stores the score
in the B-tree elements themselves. Since all that we need to store in
that B-tree is the extent score and offset, those can fit into single
8 byte value instead of 24 bytes of q_exts_by_addr element and can be
compared with single operation.
- Better decouple secondary tree logic from main range_tree by moving
rt_btree_ops and related functions into dsl_scan.c as ext_size_ops.
Those functions are very small to worry about the code duplication and
range_tree does not need to know details such as rt_btree_compare.
- Instead of accounting number of pending bytes per pool, that needs
atomic on global variable per block, account the number of non-empty
per-vdev queues, that change much more rarely.
- When extent scan is interrupted by TXG end, continue it in the next
TXG instead of selecting next best extent. It allows to avoid leaving
one truncated (and so likely not the best any more) extent each TXG.
On top of some other optimizations this saves about 1.5 minutes out of
10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13576
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read. This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children. Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.
However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.
While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid. This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance. This means the damaged
half of the mirror will go undetected even after scrubbing.
For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.
This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing. When the BP isn't available for
verification, then compare the data buffers from each child. They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children. Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13555
The kmem_alloc(sizeof (*ctx), KM_NOSLEEP) call on FreeBSD can't be
used in this code segment. Work around this by pre-allocating a percpu
context array for later use.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13568
During sorted scrub multiple threads (one per vdev) are issuing many
ZIOs same time, all using the same scn->scn_zio_root ZIO as parent.
It causes huge lock contention on the single global lock on that ZIO.
Improve it by introducing per-queue null ZIOs, children to that one,
and using them instead as proxy.
For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this
dramatically reduces lock contention and reduces scrub time from 21
minutes down to 12.5, while actual read stages (not scan) are about
3x faster, reaching 100K blocks per second per vdev.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13553
Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should
count 2x sizeof (range_seg_gap_t) per node. And since average B-tree
memory efficiency is about 75%, we should increase it to 3x.
Previous code under-counted up to 30% of the memory usage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13537
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes#10058Closes#12918
It is typical, but not generally true that if log summary has more
blocks it must also have unflushed metaslabs. Normally with metaslabs
flushed in order it works, but there are known exceptions, such as
device removal or metaslab being loaded during its flush attempt.
Before 600a02b884 if spa_flush_metaslabs() hit loading metaslab it
usually stopped (unless memlimit is also exceeded), but now it may
flush more metaslabs, just skipping that particular one. This
increased chances of assertion to fire when the skipped metaslab is
flushed on next iteration if all other metaslabs in that summary
entry are already flushed out of order.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13486Closes#13513
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#13518
In current zil_commit() process, transaction lwb_tx is assigned in
zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done().
Thus, during lwb write out process, the txg is held in open or quiesing
state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio
latency is high, it will cause txg_sync_thread() to starve.
The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the
'syncing' txg state. That is, in zil_sync().
In this patch, it achieves the goal without holding transaction.
A new function zil_lwb_flush_wait_all() is introduced. It waits for
the completion of all the zil_lwb_flush_vdevs_done() by given txg.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes#12321
- Make prefetch distance adaptive: up to 4MB prefetch doubles for
every, hit same as before, but after that it grows by 1/8 every time
the prefetch read does not complete in time to satisfy the demand.
My tests show that 4MB is sufficient for wide NVMe pool to saturate
single reader thread at 2.5GB/s, while new 64MB maximum allows the
same thread to reach 1.5GB/s on wide HDD pool. Further distance
increase may increase speed even more, but less dramatic and with
higher latency.
- Allow early reuse of inactive prefetch streams: streams that never
saw hits can be reused immediately if there is a demand, while others
can be reused after 1s of inactivity, starting with the oldest. After
2s of inactivity streams are deleted to free resources same as before.
This allows by several times increase strided read performance on HDD
pool in presence of simultaneous random reads, previously filling the
zfetch_max_streams limit for seconds and so blocking most of prefetch.
- Always issue intermediate indirect block reads with SYNC priority.
Each of those reads if delayed for longer may delay up to 1024 other
block prefetches, that may be not good for wide pools.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13452
This issue was discovered by zloop runs. When a mirror or other
redundant top-level vdev has a disk failure, and the disk is replaced,
the rebuild process occurs. A removal can happen while this is in
progress. If the removal completes before the rebuild does, the
removal process will try to free the vdev that is still in use.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#13498
Original Log Size Limit implementation blocked all writes in case of
limit reached until the TXG is committed and the log is freed. It
caused huge delays and following speed spikes in application writes.
This implementation instead smoothly throttles writes, using exactly
the same mechanism as used for dirty data.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: jxdking <lostking2008@hotmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Issue #12284Closes#13476
It turns out that "do LZ4 and zstd-1 both fail" is a great heuristic
for "don't even bother trying higher zstd tiers".
By way of illustration:
$ cat /incompress | mbuffer | zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_normal
summary: 39.8 GiByte in 3min 40.2sec - average of 185 MiB/s
$ echo 3 | sudo tee /sys/module/zzstd/parameters/zstd_lz4_pass
3
$ cat /incompress | mbuffer -m 4G | zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_patched
summary: 39.8 GiByte in 48.6sec - average of 839 MiB/s
$ sudo zfs list -p -o name,used,lused,ratio evenfaster/lowcomp_1M_zstd12_normal evenfaster/lowcomp_1M_zstd12_patched
NAME USED LUSED RATIO
evenfaster/lowcomp_1M_zstd12_normal 39549931520 42721221632 1.08
evenfaster/lowcomp_1M_zstd12_patched 39626399744 42721217536 1.07
$ python3 -c "print(39626399744 - 39549931520)"
76468224
$
I'll take 76 MB out of 42 GB for > 4x speedup.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13244
We want `zpool import` to be highly robust and never panic, even
when encountering corrupt metadata. This is already handled in the
arc_read() code path, which covers most cases, but spa_load_verify_cb()
relies on zio_read() and is responsible for verifying the block pointer.
During import it is also possible to encounter blocks pointers which
contain ZIO_COMPRESS_INHERIT and ZIO_CHECKSUM_INHERIT values. Relax
the verification function slightly to allow this.
Futhermore, extend dsl_scan_recurse() to verify the block pointer
contents of level zero blocks which are not of type DMU_OT_DNODE or
DMU_OT_OBJSET. This is handled by arc_read() in the other cases.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13124Closes#13360
There are times when end-users may wish to have
a fast and convenient method to get zpool guid
without having to use libzfs. This commit
exposes the zpool guid via kstats in similar
manner to the zpool state.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes#13466
Recognise initial whitespace, + in both cases,
and - also in unsigneds
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13434
clang-15 emits the following error message for functions without
a prototype:
fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error:
a function declaration without a prototype is deprecated
in all versions of C [-Werror,-Wstrict-prototypes]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aidan Harris <me@aidanharr.is>
Closes#13421
I genuinely don't know why this didn't come up before,
but adding the LZ4 early abort pointed out this flaw,
in which we're allocating a buffer of one size, and
then telling the compressor that we're handing it buffers
of a different size, which may be Very Different - say,
allocating 512b and then telling it the inputs are 128k.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13375
When calculating mg_aliquot alike to #12046 use number of unique data
disks in the vdev, not the total number of children vdev. Increase
default value of the tunable from 512KB to 1MB to compensate.
Before this change each disk in striped pool was getting 512KB of
sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB.
After this change in all the cases each disk should get 1MB.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13388
Holding a dbuf is a common operation which can become highly contended
in dbuf_find() when acquiring the dbuf hash mutex. This is particularly
true on Linux when reading/writing volumes since by default up to 32
threads from the zvol_taskq may be taking a hold of the same dbuf.
This should also be observable on FreeBSD as long as there are enough
processes accessing the volume concurrently.
This is further aggregrated by the fact that only the block id will
be unique when calculating the dbuf hash for a single volume. The
objset id, object id, and level will be the same for data blocks.
This has been observed to result in a somehwat less than uniform hash
distribution and a longer than expected max hash chain depth (~20)
on a large memory system (256 GB) using volumes.
This commit improves the siutation by switching the hash mutex to
an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS
from 2048 to 8192 to further reduce the odds of a hash collision.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13405
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.
This commit deals with 2 cases:
- No page writeback is active. A WB_SYNC_ALL page writeback starts
and even completes. But when it's about to check if the PG_writeback
bit has been cleared, another writeback with WB_SYNC_NONE starts.
The sync page writeback ends up waiting for the non-sync page
writeback to complete.
- A page writeback with WB_SYNC_NONE is already active when a
WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
waiting for the WB_SYNC_NONE writeback.
The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes#12662Closes#12790
Commit 361a7e8 (log xattr=sa create/remove/update to ZIL) introduced a
TX_SETSAXATTR, but missed to add a corresponding entry in
zvol_replay_vector. Adding a missing replay entry in zvol_replay_vector.
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#13396Closes#13395
Increase the default allowed maximum recordsize from 1M to 16M.
As described in the zfs(4) man page, there are significant costs
which need to be considered before using very large blocks.
However, there are scenarios where they make good sense and
it should no longer be necessary to artificially restrict their
use behind a module option.
Note that for 32-bit platforms we continue to leave this
restriction in place due to the limited virtual address space
available (256-512MB). On these systems only a handful
of blocks could be cached at any one time severely impacting
performance and potentially stability.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12830Closes#13302
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool. As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits. We've observed one of such systems importing pool for about
45 minutes.
This patch improves the situation from five sides:
- By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
- By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
- By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
- By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
- By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.
As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12789
Currently, determining which datasets are affected by corruption is
a manual process.
The primary difficulty in reporting the list of affected snapshots is
that since the error was initially found, the snapshot where the error
originally occurred in, may have been deleted. To solve this issue, we
add the ID of the head dataset of the original snapshot which the error
was detected in, to the stored error report. Then any time a filesystem
is deleted, the errors associated with it are deleted as well. Any time
a clone promote occurs, we modify reports associated with the original
head to refer to the new head. The stored error reports are identified
by this head ID, the birth time of the block which the error occurred
in, as well as some information about the error itself are also stored.
Once this information is stored, we can find the set of datasets
affected by an error by walking back the list of snapshots in the given
head until we find one with the appropriate birth txg, and then traverse
through the snapshots of the clone family, terminating a branch if the
block was replaced in a given snapshot. Then we report this information
back to libzfs, and to the zpool status command, where it is displayed
as follows:
pool: test
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3
08:27:57 2021
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
sdb ONLINE 0 0 1.58K
errors: Permanent errors have been detected in the following files:
test@1:/test.0.0
/test/test.0.0
/test/1clone/test.0.0
A new feature flag is introduced to mark the presence of this change, as
well as promotion and backwards compatibility logic. This is an updated
version of #9175. Rebase required fixing the tests, updating the ABI of
libzfs, updating the man pages, fixing bugs, fixing the error returns,
and updating the old on-disk error logs to the new format when
activating the feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain <tulsi.jain@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#9175Closes#12812
Originally it was thought it would be useful to split up the kmods
by functionality. This would allow external consumers to only load
what was needed. However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward. Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13274
These are displayed as the descriptions of the sysctl's on FreeBSD
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#13334
An earlier commit introduces AT_MODE into the shared kernel sources,
instead of the preferred existing ATTR_MODE use.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Jorgen lundman <lundman@lundman.net>
Closes#13293
Parts of the Linux kernel build system struggle with _Noreturn. This
results in the following warnings when building on RHEL 8.5, and likely
other environments. Switch to using the __attribute__((noreturn)).
warning: objtool: dbuf_free_range()+0x2b8:
return with modified stack frame
warning: objtool: dbuf_free_range()+0x0:
stack state mismatch: cfa1=7+40 cfa2=7+8
...
WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation
failed, symbol will not be versioned.
WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation
failed, symbol will not be versioned.
...
Additionally, __thread_exit() has been renamed spl_thread_exit() and
made a static inline function. This was needed because the kernel
will generate a warning for symbols which are __attribute__((noreturn))
and then exported with EXPORT_SYMBOL.
While we could continue to use _Noreturn in user space I've also
switched it to __attribute__((noreturn)) purely for consistency
throughout the code base.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13238
This explodes as -Wunused-variable on GCC 8.5.0, despite it being used,
just not in an evaluated context
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13195
bcopy() has a confusing argument order and is actually a move, not a
copy; they're all deprecated since POSIX.1-2001 and removed in -2008,
and we shim them out to mem*() on Linux anyway
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12996
Caught by UBSAN: ZI_NO_DVA is passed explicitly in
zio_handle_decrypt_injection() and can be an ENOENT from zio_match_dva()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13146Closes#13190
Add physical device size/capacity only for physical devices in
'zpool list -v' instead of displaying "-" in the SIZE column.
This would make it easier to see the individual device capacity and
to determine which spares are large enough to replace which devices.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#12561Closes#13106
When unlinking multiple files from a pool at 100% capacity, it was
possible for ENOSPC to be returned after the first unlink. e.g.
rm -f /mnt/fs/test1.0.0 /mnt/fs/test1.1.0 /mnt/fs/test1.2.0
rm: cannot remove '/mnt/fs/test1.1.0': No space left on device
rm: cannot remove '/mnt/fs/test1.2.0': No space left on device
After waiting for the pending deferred frees from the first unlink to
be processed the remaining files can then be unlinked. This is caused
by the quota limit in dsl_dir_tempreserve_impl() being temporarily
decreased to the allocatable pool capacity less any deferred free
space.
This is resolved using the existing mechanism of returning ERESTART
when over quota as long as we know enough space will shortly be
available after processing the pending deferred frees.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13172
A function that returns with no value is a different thing from a
function that doesn't return at all. Those are two orthogonal
concepts, commonly confused.
pthread_create(3) expects a pointer to a start routine that has a
very precise prototype:
void *(*start_routine)(void *);
However, other thread functions, such as kernel ones, expect:
void (*start_routine)(void *);
Providing a different one is incorrect, and has only been working
because the ABIs happen to produce a compatible function.
We should use '_Noreturn void', since it's the natural type, and
then provide a '_Noreturn void *' wrapper for pthread functions.
For consistency, replace most cases of __NORETURN or
__attribute__((noreturn)) by _Noreturn. _Noreturn is understood
by -std=gnu89, so it should be safe to use everywhere.
Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136
Ref: https://software.codidact.com/posts/285972
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Closes#13120
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.
This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.
Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:
1. The feature must be enabled. Otherwise, we don't log the new
record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
the ZIL chain will activate the feature for the dataset. Likewise
for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.
We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.
We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.
One more piece on feature activation, since it's spread across
multiple functions:
zil_commit()
zil_process_commit_list()
if lwb == NULL // first zil_commit since zil_open
zil_create()
if no log block pointer in ZIL header:
if feature enabled and not active:
// CASE 1
enable, COALESCE txg wait with dmu_tx that allocated the
log block
else // log block was allocated earlier than this zil_open
if feature enabled and not active:
// CASE 2
enable, EXPLICIT txg wait
else // already have an in-DRAM LWB
if feature enabled and not active:
// this happens when we enable the feature after zil_create
// CASE 3
enable, EXPLICIT txg wait
[1] da6c28aaf6
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#8768Closes#9078
New `zfs_type_t` value `ZFS_TYPE_INVALID` is introduced.
Variable initialization is now possible to make GCC happy.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#12167Closes#13103
Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with
ashift=12 results to failure when mounting pool2/encrypted (Input/Output
error). Notably, the opposite, raw sending from a greater ashift to a
lower one does not fail.
This happens because zio_compress_write() falsely checks only
ZIO_FLAG_RAW_COMPRESS and not ZIO_FLAG_RAW_ENCRYPT which is also set in
encrypted raw send streams. In this case it rounds up the psize and if
not equal to the zio->io_size it modifies the block by zeroing out
the extra bytes. Because this happens in a SA attr. registration object
(type=46), the decryption fails upon mounting the filesystem, and zpool
status falsely reports an error.
Fix this by checking both ZIO_FLAG_RAW_COMPRESS and ZIO_FLAG_RAW_ENCRYPT
before deciding whether to zero-pad a block.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#13067Closes#13074
It's the only one actually used
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12901
Add hooks for when spa is created, exported, activated and
deactivated. Used by macOS to attach iokit, and lock
kext as busy (to stop unloads).
Userland, Linux, and, FreeBSD have empty stubs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12801
There are two codepaths than can dirty final TXGs:
1) If calling spa_export_common()->spa_unload()->
spa_unload_log_sm_flush_all() after the spa_final_txg is set, then
spa_sync()->spa_flush_metaslabs() may end up dirtying the final
TXGs. Then we have the following panic:
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x62
spl_panic+0xea/0x102 [spl]
dbuf_dirty+0xcd6/0x11b0 [zfs]
zap_lockdir_impl+0x321/0x590 [zfs]
zap_lockdir+0xed/0x150 [zfs]
zap_update+0x69/0x250 [zfs]
feature_sync+0x5f/0x190 [zfs]
space_map_alloc+0x83/0xc0 [zfs]
spa_generate_syncing_log_sm+0x10b/0x2f0 [zfs]
spa_flush_metaslabs+0xb2/0x350 [zfs]
spa_sync_iterate_to_convergence+0x15a/0x320 [zfs]
spa_sync+0x2e0/0x840 [zfs]
txg_sync_thread+0x2b1/0x3f0 [zfs]
thread_generic_wrapper+0x62/0xa0 [spl]
kthread+0x127/0x150
ret_from_fork+0x22/0x30
</TASK>
2) Calling vdev_*_stop_all() for a second time in spa_unload() after
spa_export_common() unnecessarily delays the final TXGs beyond what
spa_final_txg is set at.
Fix this by performing the check and call for
spa_unload_log_sm_flush_all() before the spa_final_txg is set in
spa_export_common(). Also check if the spa_final_txg has already been
set in spa_unload() and skip those calls in this case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
External-issue: https://www.illumos.org/issues/9081Closes#13048Closes#13098
Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#13097
On newer compilers, dsl_dataset.c now warns (or, on DEBUG, errors)
on uninitialized variable usage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13083
dmu_recv_begin_check() unconditionally sets the DS_HOLD_FLAG_DECRYPT
flag before calling dsl_dataset_hold_flags(). If the key on the
receiving side isn't loaded or the send stream contains embedded
blocks, the receive check fails for a stream which is perfectly
valid and could be received without any problem. This seems like
a remnant of the initial design, where unencrypted datasets below
encrypted ones weren't allowed.
Add a condition to set `DS_HOLD_FLAG_DECRYPT` only for encrypted
datasets, modify an existing test to detect this regression and add
a test for raw replication streams.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#13033Closes#13076
All of these externs are already #included as static inline
functions via corresponding headers.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#13073
There's no need to make the platform ops dynamic dispatch.
This change replaces the dynamic dispatch with static calls to the
platform-specific functions.
To avoid name collisions, prefix all platform-specific functions
with `zvol_os_`.
I actually find `zvol_..._os` slightly nicer to read in the calling
code, but having it as a prefix is useful.
Advantage:
- easier jump-to-definition / grepping
- potential benefits to static analysis
- better legibility
Future work: also prefix remaining `static` functions in zvol_os.c.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes#12965
Use error thresholds from policy to control whether to scrub data
and/or metadata. If threshold is set to UINT64_MAX, then caller
probably does not care about result and we may skip that part.
By default import neither set the data error threshold nor read
the error counter, so skip the data scrub for faster import.
Metadata are still scrubbed and fail if even single error found.
While there just for symmetry return number of metadata errors in
case threshold is not set to zero and we haven't reached it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#13022
The following commit moved the users of `deferred` into function
dsl_pool_unreserved_space:
commit d2734cce68
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Date: Fri Dec 16 14:11:29 2016 -0800
OpenZFS 9166 - zfs storage pool checkpoint
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes#13056
POSIX requires that set-uid and set-gid bits to be removed when an
unprivileged user writes to a file and ZFS does that during normal
operation.
The problem arrises when the write is stored in the ZIL and replayed.
During replay we have no access to original credentials of the process
doing the write, so zfs_write() will be performed with the root
credentials. When root is doing the write set-uid and set-gid bits
are not removed from the file.
To correct that, log a separate TX_SETATTR entry that removed those bits
on first write to such file.
Idea from: Christian Schwarz
Add test for ZIL replay of setuid/setgid clearing.
Improve various edge cases when clearing setid bits:
- The setid bits can be readded during a single write, so make sure to check
for them on every chunk write.
- Log TX_SETATTR record at most once per transaction group (if the setid bits
are keep coming back).
- Move zfs_log_setattr() outside of zp->z_acl_lock.
Reviewed-by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#13027
`configure` now accepts `--enable-asan` and `--enable-ubsan` switches
which results in passing `-fsanitize=address`
and `-fsanitize=undefined`, respectively, to the compiler. Those
flags are enabled in GitHub workflows for ZTS and zloop. Errors
reported by both instrumentations are corrected, except for:
- Memory leak reporting is (temporarily) suppressed. The cost of
fixing them is relatively high compared to the gains.
- Checksum computing functions in `module/zcommon/zfs_fletcher*`
have UBSan errors suppressed. It is completely impractical
to enforce 64-byte payload alignment there due to performance
impact.
- There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom
memory allocator is used there rendering that measure
unfeasible.
- Memory leaks detection has to be suppressed for `cmd/zvol_id`.
`zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is
incompatible with memory leaks detection.
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#12928
In files created/modified before 4254acb there may be a corruption of
xattrs which is not reported during scrub and normal send/receive. It
manifests only as an error when raw sending/receiving. This happens
because currently only the raw receive path checks for discrepancies
between the dnode bonus length and the spill pointer flag.
In case we encounter a dnode whose bonus length is greater than the
predicted one, we should report an error. Modify in this regard
dnode_sync() with an assertion at the end, dump_dnode() to error out,
dsl_scan_recurse() to report errors during a scrub, and zstream to
report a warning when dumping. Also added a test to verify spill blocks
are sent correctly in a raw send.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12720Closes#13014
69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1)
had a useful policy regarding
CALL(ARG1,
ARG2,
ARG3);
above 2 lines. As it stands, it spits out *both*
sysctl_os.c: 385: continuation line should be indented by 4 spaces
sysctl_os.c: 385: indent by spaces instead of tabs
which is very cool
Another >10 could be fixed by removing "ulong" &al. handling.
I don't foresee anyone actually using it intentionally
(does it even exist in modern headers? why did it in the first place?).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12993
FreeBSD's implementation of zfs_uio_fault_move() returns EFAULT when a
page fault occurs while copying data in or out of user buffers. The VFS
treats such errors specially and will retry the I/O operation (which may
have made some partial progress).
When the FreeBSD and Linux implementations of zfs_write() were merged,
the handling of errors from dmu_write_uio_dbuf() changed such that
EFAULT is not handled as a partial write. For example, when appending
to a file, the z_size field of the znode is not updated after a partial
write resulting in EFAULT.
Restore the old handling of errors from dmu_write_uio_dbuf() to fix
this. This should have no impact on Linux, which has special handling
for EFAULT already.
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#12964
Raw receiving a snapshot back to the originating dataset is currently
impossible because of user accounting being present in the originating
dataset.
One solution would be resetting user accounting when raw receiving on
the receiving dataset. However, to recalculate it we would have to dirty
all dnodes, which may not be preferable on big datasets.
Instead, we rely on the os_phys flag
OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is
incomplete when raw receiving. Thus, on the next mount of the receiving
dataset the local mac protecting user accounting is zeroed out.
The flag is then cleared when user accounting of the raw received
snapshot is calculated.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12981Closes#10523Closes#11221Closes#11294Closes#12594
Issue #11300
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#12985
Evaluated every variable that lives in .data (and globals in .rodata)
in the kernel modules, and constified/eliminated/localised them
appropriately. This means that all read-only data is now actually
read-only data, and, if possible, at file scope. A lot of previously-
global-symbols became inlinable (and inlined!) constants. Probably
not in a big Wowee Performance Moment, but hey.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12899
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Maybee <mark.maybee@delphix.com>
Closes#12951
There should be no risk of us accidentally hitting this since
we'd need maliciously malformed data to wind up in the pipeline,
or a very unfortunate random bit flip at exactly the right moment.
Still since we can handle it we should.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12947
As an experiment, I stole the lz4 decompressor from
upstream lz4 (1.9.3), and landed it.
Feedback suggested that keeping the vendor lz4 code isolated and
unlinted was probably reasonable, so I lobbed it into its own file.
It also seemed reasonable to put the mostly-untouched* code into
lz4.c proper, and relegate the integrated and ZFS-specific code to
lz4_zfs.c.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12805
Probably introduced inadvertently in b525630 (Native Encryption).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes#12935
This reverts commit f6a0dac84a.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#12938
Verify that all empty sectors are zero filled before using them to
calculate parity. Failure to do so can result in incorrect parity
columns being generated and written to disk if the contents of an
empty sector are non-zero. This was possible because the checksum
only protects the data portions of the buffer, not the empty sector
padding.
This issue has been addressed by updating raidz_parity_verify() to
check that all dRAID empty sectors are zero filled. Any sectors
which are non-zero will be fixed, repair IO issued, and a checksum
error logged. They can then be safely used to verify the parity.
This specific type of damage is unlikely to occur since it requires
a disk to have silently returned bad data, for an empty sector, while
performing a scrub. However, if a pool were to have been damaged
in this way, scrubbing the pool with this change applied will repair
both the empty sector and parity columns as long as the data checksum
is valid. Checksum errors will be reported in the `zpool status`
output for any repairs which are made.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12857
On FreeBSD vnode reclamation is single-threaded, protected by single
global lock. Linux seems to be able to use a thread per mount point,
but at this time it creates more harm than good.
Reduce number of threads to 1, adding tunable in case somebody wants
to try more.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#12896
Issue #9966
If the fields to be listed and sorted by are constrained
to those populated by dsl_dataset_fast_stat(), then
zfs list is much faster, as it does not need to open each
objset and reads its properties.
A previous optimization by Pawel Dawidek
(0cee24064a) took advantage
of this to make listing snapshot names sorted only by name
much faster.
However, it was limited to `-o name -s name`, this work
extends this optimization to work with:
- name
- guid
- createtxg
- numclones
- inconsistent
- redacted
- origin
and could be further extended to any other properties
supported by dsl_dataset_fast_stat() or similar, that do
not require extra locking or reading from disk.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#11080
Improve the ability of zfs send to determine if a block is compressed
or not by using information contained in the blkptr.
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#12770
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#12728
For my sins, I started running valgrind over ztest to try and fix
that pesky intermittent "zloop dies with malloc errors" problem.
This one seemed exciting enough to merit cutting a PR for before
the rest get polished.
Suggested-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12214
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.
For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again. However, as of the 5.13 kernel this behavior was removed.
Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open(). This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail. To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #12301Closes#12759
Strict hole reporting was previously disabled by default as a
performance optimization. However, this has lead to confusion
over the expected behavior and a variety of workarounds being
adopted by consumers of ZFS. Change the default behavior to
always report holes and force the TXG sync.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12746
- Allocate ve_search on the stack, so we avoid allocating memory for
every I/O even if the VDEV cache is disabled.
- Reduce lock scope.
- Avoid locking in vdev_cache_read() when the VDEV cache is disabled.
- Sort file names properly.
- Correct comment.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#12749
Add properties, similar to pool properties, to each vdev.
This makes use of the existing per-vdev ZAP that was added as
part of device evacuation/removal.
A large number of read-only properties are exposed,
many of the members of struct vdev_t, that provide useful
statistics.
Adds support for read-only "removing" vdev property.
Adds the "allocating" property that defaults to "on" and
can be set to "off" to prevent future allocations from that
top-level vdev.
Supports user-defined vdev properties.
Includes support for properties.vdev in SYSFS.
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#11711
The code is integrated, builds fine, runs fine, there's not really
any reason not to.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12735
Special allocation class or dedup vdevs may have roughly the same
performance as L2ARC vdevs. Introduce a new tunable to exclude those
buffers from being cacheable on L2ARC.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#11761Closes#12285
The l2cache device could be added twice because vdev_inuse() does not
check spa_l2cache for added devices. Make l2cache vdevs inuse checking
logic more closer to spare vdevs.
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#9153Closes#12689
In addition to flushing memory mapped regions when checking holes,
commit de198f2d95 modified the dirty dnode detection logic to check
the dn->dn_dirty_records instead of the dn->dn_dirty_link. Relying
on the dirty record has not be reliable, switch back to the previous
method.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11900Closes#12745
The only zdb utility require to read metaslab-related data during
read-only pool import because of spacemaps validation. Add global
variable which will allow zdb read spacemaps in case of readonly
import mode.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#9095Closes#12687
In order to reduce contention on the vq_lock, optional skip sectors
for Raidz writes can be placed into a single IO request. This is done by
padding out the linear ABD for a parity column to contain the skip
sector and by creating gang ABD to contain the data and skip sector for
data columns.
The vdev_raidz_map_alloc() function now contains specific functions for
both reads and write to allocate the ABD's that will be issued down to
the VDEV chldren.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-By: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#12333
When using lseek(2) to report data/holes memory mapped regions of
the file were ignored. This could result in incorrect results.
To handle this zfs_holey_common() was updated to asynchronously
writeback any dirty mmap(2) regions prior to reporting holes.
Additionally, while not strictly required, the dn_struct_rwlock is
now held over the dirty check to prevent the dnode structure from
changing. This ensures that a clean dnode can't be dirtied before
the data/hole is located. The range lock is now also taken to
ensure the call cannot race with zfs_write().
Furthermore, the code was refactored to provide a dnode_is_dirty()
helper function which checks the dnode for any dirty records to
determine its dirtiness.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11900Closes#12724
It turns out that short-circuiting the EFAULT behavior on a short read
breaks things on FreeBSD. So until there's a nicer solution, let's
just revert the behavior for not-Linux.
Reference:
https://reviews.freebsd.org/R10:70f51f0e474ffe1fb74cb427423a2fba3637544d
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12698
If you've got multiple scrubs/resilvers going, it's rather helpful
to know which pool each scan line refers to.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes: #12674
The fnvlist versions of the functions are fatal if they fail,
saving each call from having to include checking the result.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Remove code duplication by moving code responsible for partial block
zeroing to a separate function: dnode_partial_zero().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#12627
Make the main dmu_buf_hold_array() function non-static.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#12628
Lustre makes light use of the zfs_refcount interfaces which
isn't a problem when using a non-debug build of OpenZFS. However,
when debugging is enabled the required symbols are not exported.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12613
refcount_add_many(foo,N) is not the same as
for (i=0; i < N; i++) { refcount_add(foo); }
Unfortunately, this is only actually true with debug kernels and
reference_tracking_enable=1.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12589Closes#12602
When you create a pool, zfs writes vd->vdev_enc_sysfs_path with the
enclosure sysfs path to the fault LEDs, like:
vdev_enc_sysfs_path = /sys/class/enclosure/0:0:1:0/SLOT8
However, this enclosure path doesn't get updated on successive imports
even if enclosure path to the disk changes. This patch fixes the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#11950Closes#12095
Commit 0c03d21ac9 left in a redundant if condition while
removing some code. Just remove it.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#12598
Currently, dmu_read_uio_dnode can read 64K of a requested 1M in one
loop, get EFAULT back from zfs_uiomove() (because the iovec only holds
64k), and return EFAULT, which turns into EAGAIN on the way out. EAGAIN
gets interpreted as "I didn't read anything", the caller tries again
without consuming the 64k we already read, and we're stuck.
This apparently works on newer kernels because the caller which breaks
on older Linux kernels by happily passing along a 1M read request and a
64k iovec just requests 64k at a time.
With this, we now won't return EFAULT if we got a partial read.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12370Closes#12509Closes#12516
In case an ARC buffer is allocated only on L2ARC, and there are
underlying errors in a pool with the cache device in faulty state, a
panic can occur in arc_read_done()->arc_hdr_destroy()->
arc_hdr_l2arc_destroy()->arc_hdr_clear_flags() when trying to free
the ARC buffer.
Fix this by discarding the buffer's identity in arc_hdr_destroy(), in
case the buffer is not empty, before calling arc_hdr_l2hdr_destroy().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12392
As of the Linux 5.9 kernel a fallthrough macro has been added which
should be used to anotate all intentional fallthrough paths. Once
all of the kernel code paths have been updated to use fallthrough
the -Wimplicit-fallthrough option will because the default. To
avoid warnings in the OpenZFS code base when this happens apply
the fallthrough macro.
Additional reading: https://lwn.net/Articles/794944/
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12441
Userland figures out which encryption-root keys are required to load,
and issues ZFS_IOC_LOAD_KEY.
The tail section of spa_keystore_load_wkey() will call
zvol_create_minors() on the encryption-root object.
Any clones of the encrypted zvol will not be plumbed. This commits
adds additional logic to detect if zvol has clones, and is encrypted,
then adds these to the list of zvols to call zvol_create_minors() on.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12471
Errors in zil_lwb_write_done() are not propagated to
zil_lwb_flush_vdevs_done() which can result in zil_commit_impl()
not returning an error to applications even when zfs was not able
to write data to the disk.
Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to
allow errors to propagate and consolidate the error handling for
flush and write errors to a single location (rather than having
error handling split between the "write done" and "flush done"
handlers).
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Arun KV <arun.kv@datacore.com>
Closes#12391Closes#12443
The block pointer verification check in arc_read() should also
cover embedded block pointers. While highly unlikely, accessing
a damaged block pointer can result in panic. To further harden
the code extend the existing check to include embedded block
pointers and add a comment explaining the rational for this
sanity check. Lastly, correct a flaw in zfs_blkptr_verify()
so the error count is checked even when checking a untrusted
config to verify the non-pool-specific portions of a block
pointer.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12535
For kernel to send snapshot mount/unmount events to zed.
For kernel to send symlink creates/removes on zvol plumbing.
(/dev/run/dsk/zvol/$pool/$zvol -> /dev/diskX)
If zed misses the ENODEV, all errors after are EINVAL. Treat any error
as kernel module failure.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12416
When zfs_send_corrupt_data is set, use the TRAVERSE_HARD flag,
so traverse_visitbp() will not fail with ECKSUM if a blockpointer
cannot be read, but rather will continue and send the objects it can.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Sponsored-By: Klara Inc.
Sponsored-By: WHC Online Solutions Inc.
Closes#12541
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure. This assertion was
added in error, so we're removing it.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#9897Closes#12020Closes#12246
We round up the psize to the nearest multiple of the asize or to the
lsize, whichever is smaller. Once that's done, we allocate a new
buffer of the appropriate size, zero the tail, and copy the data
into it. This adds a small performance cost to these kinds of writes,
but fixes the bookkeeping problems.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#12522Closes#8462
Previously, zpool-iostat did not display any data regarding rebuild I/Os
in either the latency/size histograms (-w/-l/-r) or the queue data (-q).
This fix essentially utilizes the existing infrastructure for tracking
rebuild queue data and displays this data in the proper places within
zpool-iostat's output.
Signed-off-by: Trevor Bautista <tbautista@newmexicoconsortium.org>
Signed-off-by: Trevor Bautista <tbautista@lanl.gov>
Co-authored-by: Trevor Bautista <tbautista@newmexicoconsortium.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
benchmark_raidz() allocates a row to benchmark parity calculation and
reconstruction. In the latter case, the parity blocks are left
uninitialized, leading to reports from KMSAN.
Initialize parity blocks to 0xAA as we do for the data earlier in the
function. This does not affect the selected RAID-Z implementation on
any of several systems tested.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#12473
When a header is allocated for full overwrite it is a waste of time
to allocate b_pabd/b_rabd for it, since arc_write() will free them
without ever being touched. If it is a read or a partial overwrite
then arc_read() and arc_hdr_decrypt() allocate them explicitly.
Reduced memory allocation in user threads also reduces ARC eviction
throttling there, proportionally increasing it in ZIO threads, that
is not good. To minimize or even avoid it introduce ARC allocation
reserve, allowing certain arc_get_data_abd() callers to allocate a
bit longer in situations where user threads will already throttle.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12398
It is very expensive and not informative to call multilist_is_empty()
for each arc_change_state() on debug builds to check for impossible.
Instead implement special index function for arc_l2c_only->arcs_list,
multilists, panicking on any attempt to use it.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12421
Instead of clearing stats inside arc_buf_alloc_impl() do it inside
arc_hdr_alloc() and arc_release(). It fixes statistics being wiped
every time a new dbuf is filled from the ARC.
Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits.
Since the hits are accounted under hash lock, replace atomics with
simple increments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12422
vq_lock is already too congested for two more operations per I/O.
Instead of dropping and reacquiring it inside vdev_queue_aggregate()
delegate the zio_vdev_io_bypass() and zio_execute() calls for parent
I/Os to callers, that drop the lock any way to execute the new I/O.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12297
Use atomic_load_64() for zfs_refcount_count() to prevent torn reads
on 32-bit platforms. On 64-bit ones it should not change anything.
When built with ZFS_DEBUG but running without tracking enabled use
atomics instead of mutexes same as for builds without ZFS_DEBUG.
Since rc_tracked can't change live we can check it without lock.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12420
Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max
to a disallowed value would return an error.
Since the switch, it instead only generates WARN_IF_TUNING_IGNORED
Keep the ability to set the sysctl's specifically to 0, even though
that is less than the minimum, because some tests depend on this.
Also lost, was the ability to set vfs.zfs.arc_max to a value less
than the default vfs.zfs.arc_min at boot time. Restore this as well.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#12161
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.
On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes#12397
It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to
decide whether to reallocate/copy the buffer, because the answer is
OS-specific and depends on the buffer size. Instead of that use
abd_size_alloc_linear(), moved into public header.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#12425
Commit 5dbf6c5a66 did not address these format specifier warnings
since they were introduced by an unrelated commit which had not
been rebased on 5dbf6c5a66 when merged. Fix them.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12435
In l2arc_add_vdev() first decide whether the device is eligible for
L2ARC rebuild or whole device trim and then add it to the list of cache
devices. Otherwise l2arc_feed_thread() might already start writing on
the device invalidating previous content as l2ad_hand = l2ad_start.
However l2arc_rebuild_vdev() needs the device present in the cache
device list to figure out its l2arc_dev_t. Fix this by moving most of
l2arc_rebuild_vdev() in a new function l2arc_rebuild_dev() which does
not need to search in the cache device list.
In contrast to l2arc_add_vdev() we do not have to worry about
l2arc_feed_thread() invalidating previous content when onlining a
cache device. The device parameters (l2ad*) are not cleared when
offlining the device and writing new buffers will not invalidate
all previous content. In worst case only buffers that have not had
their log block written to the device will be lost.
Retire persist_l2arc_00{4,5,8} tests since they cover code already
covered by the remaining ones. Test persist_l2arc_006 is renamed to
persist_l2arc_004 and persist_l2arc_007 is renamed to persist_l2arc_005.
Fix a typo in persist_l2arc_004, and remove an assertion that is not
always true from l2arc_arcstats_pos. Also update an assertion in
persist_l2arc_005 and explain why in a comment.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12365
These were mostly used to annotate do {} while(0)s
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12201