Commit Graph

2656 Commits

Author SHA1 Message Date
Tony Hutter
11bdc5c8e8 Revert "ztest fails assertion in zio_write_gang_member_ready()"
This reverts commit 0156253d29.

That commit was identified as causing IO errors on a user's
encrypted dataset:
https://github.com/openzfs/zfs/issues/14413

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
2023-01-24 15:35:24 -08:00
Richard Yao
7319a73921 Linux ppc64le ieee128 compat: Do not redefine __asm on external headers
There is an external assembly declaration extension in GNU C that glibc
uses when building with ieee128 floating point support on ppc64le.
Marking that as volatile makes no sense, so the build breaks.

It does not make sense to only mark this as volatile on Linux, since if
do not want the compiler reordering things on Linux, we do not want the
compiler reordering things on any other platform, so we stop treating
Linux specially and just manually inline the CPP macro so that we can
eliminate it. This should fix the build on ppc64le.

Tested-by: @gyakovlev
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14308
Closes #14384
2023-01-19 12:50:42 -08:00
George Amanakis
f806306ce0 Activate filesystem features only in syncing context
When activating filesystem features after receiving a snapshot, do
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14304
Closes #14252
2023-01-19 12:50:42 -08:00
Richard Yao
a2aabac123 Zero end of embedded block buffer in dump_write_embedded()
This fixes a kernel stack leak.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Nicholas Sherlock <n.sherlock@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13778
Closes #14255
2023-01-19 12:50:36 -08:00
Richard Yao
24a6d8316a Fix dereference after null check in enqueue_range
If the bp is NULL, we have a hole. However, when we build with
assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When
this happens on a hole, we will have a NULL pointer dereference.

Reported-by: Coverity (CID-1524670)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14264
2023-01-19 12:50:36 -08:00
Matthew Ahrens
0156253d29 ztest fails assertion in zio_write_gang_member_ready()
Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved
for the salt+IV.  However, dmu_write_policy() allows non-encrypted
blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and
allocate 3 DVA's, since they don't need a salt+IV (they are merely
authenicated).

However, if such a block becomes a gang block, the gang code incorrectly
limits the gang block header to 2 DVA's.  This leads to a "NDVAs
inversion", where a parent block (the gang block header) has less DVA's
than its children (the gang members), causing an assertion failure in
zio_write_gang_member_ready().

This commit addresses the problem by only restricting the gang block
header to 2 DVA's if the block is actually encrypted (and thus its gang
block members can have at most 2 DVA's).

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14250
Closes #14356
2023-01-10 08:44:55 -08:00
Ameer Hamza
2f2d6bece8 zed: unclean disk attachment faults the vdev
If the attached disk already contains a vdev GUID, it
means the disk is not clean. In such a scenario, the
physical path would be a match that makes the disk
faulted when trying to online it. So, we would only
want to proceed if either GUID matches with the last
attached disk or the disk is in a clean state.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
2023-01-05 11:09:36 -08:00
szubersk
4767037bcf vdev_raidz_math_aarch64_neonx2.c: suppress diagnostic only for GCC
Signed-off-by: szubersk <szuberskidamian@gmail.com>
2022-12-09 12:07:38 -08:00
George Amanakis
c8d2ab05e1 Fix setting the large_block feature after receiving a snapshot
We are not allowed to dirty a filesystem when done receiving
a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
not be set on that filesystem since the filesystem is not on
dp_dirty_datasets, and a subsequent encrypted raw send will fail.
Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature
needs to be activated and do so if appropriate.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #13699
Closes #13782
2022-12-01 12:39:45 -08:00
Richard Yao
e48aaef89f Fix NULL pointer dereference in dbuf_prefetch_indirect_done()
When ZFS is built with assertions, a prefetch is done on a redacted
blkptr and `dpa->dpa_dnode` is NULL, we will have a NULL pointer
dereference in `dbuf_prefetch_indirect_done()`.

Both Coverity and Clang's Static Analyzer caught this.

Reported-by: Coverity (CID 1524671)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14210
2022-12-01 12:39:44 -08:00
наб
32f7499acf module: zfs: vdev_removal: remove unused num_indirect
Found with -Wunused-but-set-variable on Clang trunk

Fixes: a1d477c24c ("OpenZFS 7614, 9064 - zfs device evacuation/removal")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13304
2022-12-01 12:39:44 -08:00
Rich Ercolani
fa7d572a8a Handle and detect #13709's unlock regression (#14161)
In #13709, as in #11294 before it, it turns out that 63a26454 still had
the same failure mode as when it was first landed as d1d47691, and
fails to unlock certain datasets that formerly worked.

Rather than reverting it again, let's add handling to just throw out
the accounting metadata that failed to unlock when that happens, as
well as a test with a pre-broken pool image to ensure that we never get
bitten by this again.

Fixes: #13709

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2022-12-01 12:39:43 -08:00
shodanshok
d9de079a4b Fix arc_p aggressive increase
The original ARC paper called for an initial 50/50 MRU/MFU split
and this is accounted in various places where arc_p = arc_c >> 1,
with further adjustment based on ghost lists size/hit. However, in
current code both arc_adapt() and arc_get_data_impl() aggressively
grow arc_p until arc_c is reached, causing unneeded pressure on
MFU and greatly reducing its scan-resistance until ghost list
adjustments kick in.

This patch restores the original behavior of initially having arc_p
as 1/2 of total ARC, without preventing MRU to use up to 100% total
ARC when MFU is empty.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #14137
Closes #14120
2022-12-01 12:39:43 -08:00
Brooks Davis
5f53a444b3 Remove an unused variable
Clang-16 detects this set-but-unused variable which is assigned and
incremented, but never referenced otherwise.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14125
2022-12-01 12:39:43 -08:00
Richard Yao
256b74d0b0 Address warnings about possible division by zero from clangsa
* The complaint in ztest_replay_write() is only possible if something
   went horribly wrong. An assertion will silence this and if it goes
   off, we will know that something is wrong.
 * The complaint in spa_estimate_metaslabs_to_flush() is not impossible,
   but seems very unlikely. We resolve this by passing the value from
   the `MIN()` that does not go to infinity when the variable is zero.

There was a third report from Clang's scan-build, but that was a
definite false positive and disappeared when checked again through
Clang's static analyzer with Z3 refution via CodeChecker.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14124
2022-12-01 12:39:43 -08:00
Allan Jude
ac01b876c9 Avoid null pointer dereference in dsl_fs_ss_limit_check()
Check for cr == NULL before dereferencing it in
dsl_enforce_ds_ss_limits() to lookup the zone/jail ID.

Reported-by: Coverity (CID 1210459)
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #14103
2022-12-01 12:39:43 -08:00
Richard Yao
e9a8fb17b5 Fix too few arguments to formatting function
CodeQL reported that when the VERIFY3U condition is false, we do not
pass enough arguments to `spl_panic()`. This is because the format
string from `snprintf()` was concatenated into the format string for
`spl_panic()`, which causes us to have an unexpected format specifier.

A CodeQL developer suggested fixing the macro to have a `%s` format
string that takes a stringified RIGHT argument, which would fix this.
However, upon inspection, the VERIFY3U check was never necessary in the
first place, so we remove it in favor of just calling `snprintf()`.

Lastly, it is interesting that every other static analyzer run on the
codebase did not catch this, including some that made an effort to catch
such things. Presumably, all of them relied on header annotations, which
we have not yet done on `spl_panic()`. CodeQL apparently is able to
track the flow of arguments on their way to annotated functions, which
llowed it to catch this when others did not. A future patch that I have
in development should annotate `spl_panic()`, so the others will catch
this too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14098
2022-12-01 12:39:43 -08:00
Richard Yao
3830858c5c Fix memory leaks in dmu_send()/dmu_send_obj()
If we encounter an EXDEV error when using the redacted snapshots
feature, the memory used by dspp.fromredactsnaps is leaked.

Clang's static analyzer caught this during an experiment in which I had
annotated various headers in an attempt to improve the results of static
analysis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13973
2022-12-01 12:39:42 -08:00
Richard Yao
409c99a1d3 Fix NULL pointer dereference in spa_open_common()
Calling spa_open() will pass a NULL pointer to spa_open_common()'s
config parameter. Under the right circumstances, we will dereference the
config parameter without doing a NULL check.

Clang's static analyzer found this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044
2022-12-01 12:39:42 -08:00
Richard Yao
bbec0e60a8 Fix NULL pointer passed to strlcpy from zap_lookup_impl()
Clang's static analyzer pointed out that whenever zap_lookup_by_dnode()
is called, we have the following stack where strlcpy() is passed a NULL
pointer for realname from zap_lookup_by_dnode():

strlcpy()
zap_lookup_impl()
zap_lookup_norm_by_dnode()
zap_lookup_by_dnode()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044
2022-12-01 12:39:42 -08:00
Richard Yao
a5f17a94d3 fm_fmri_hc_create() must call va_end() before returning
clang-tidy caught this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044
2022-12-01 12:39:42 -08:00
Richard Yao
c562bbefc0 Fix potential NULL pointer dereference in dsl_dataset_promote_check()
If the `list_head()` returns NULL, we dereference it, right before we
check to see if it returned NULL.

We have defined two different pointers that both point to the same
thing, which are `origin_head` and `origin_ds`. Almost everything uses
`origin_ds`, so we switch them to use `origin_ds`.

We also promote `origin_ds` to a const pointer so that the compiler
verifies that nothing modifies it.

Coverity complained about this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13967
2022-12-01 12:39:41 -08:00
Richard Yao
b247d47be1 Cleanup: Make memory barrier definitions consistent across kernels
We inherited membar_consumer() and membar_producer() from OpenSolaris,
but we had replaced membar_consumer() with Linux's smp_rmb() in
zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the
Linux-only smp_rmb().

We reinstate membar_consumer() in platform independent code and fix the
FreeBSD SPL to implement membar_consumer() in a way analogous to Linux.

Reviewed-by: Konstantin Belousov <kib@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13843
2022-12-01 12:39:40 -08:00
Damian Szuberski
0f4ee295ba dsl_prop_known_index(): check for invalid prop
Resolve UBSAN array-index-out-of-bounds error in zprop_desc_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #14142
Closes #14147
2022-11-08 10:16:21 -08:00
Ameer Hamza
ca3a675c74 zed: Prevent special vdev to be replaced by hot spare
Special vdevs should not be replaced by a hot spare.
Log vdevs already support this, extending the
functionality for special vdevs.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14129
2022-11-07 13:36:57 -08:00
Attila Fülöp
cd1f023846
Deny receiving into encrypted datasets if the keys are not loaded (#14139)
Commit 68ddc06b61 introduced support
for receiving unencrypted datasets as children of encrypted ones but
unfortunately got the logic upside down. This resulted in failing to
deny receives of incremental sends into encrypted datasets without
their keys loaded. If receiving a filesystem, the receive was done
into a newly created unencrypted child dataset of the target. In
case of volumes the receive made the target volume undeletable since
a dataset was created below it, which we obviously can't handle.
Incremental streams with embedded blocks are affected as well.

We fix the broken logic to properly deny receives in such cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #13598
Closes #14055
Closes #14119
2022-11-04 11:07:29 -07:00
Ryan Moeller
b27c7a1457 zil: Relax assertion in zil_parse
Rather than panic debug builds when we fail to parse a whole ZIL, let's
instead improve the logging of errors and continue like in a release
build.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #14116
2022-11-01 12:49:14 -07:00
Mariusz Zaborski
186e39f336 quota: extend quota for dataset
This patch relax the quota limitation for dataset by around 3%.
What this means is that user can write more data then the quota is
set to. However thanks to that we can get more stable bandwidth, in
case when we are overwriting data in-place, and not consuming any
additional space.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org>
Sponsored-by: Zededa Inc.
Sponsored-by: Klara Inc.
Closes #13839
2022-11-01 12:48:37 -07:00
shodanshok
1d2b0563f7 Fix ARC target collapse when zfs_arc_meta_limit_percent=100
Reclaim metadata when arc_available_memory < 0 even if
meta_used is not bigger than arc_meta_limit.

As described in https://github.com/openzfs/zfs/issues/14054 if
zfs_arc_meta_limit_percent=100 then ARC target can collapse to
arc_min due to arc_purge not freeing any metadata.

This patch lets arc_prune to do its work when arc_available_memory
is negative even if meta_used is not bigger than arc_meta_limit,
avoiding ARC target collapse.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #14054 
Closes #14093
2022-11-01 12:48:30 -07:00
vaclavskala
8929355b4c Propagate extent_bytes change to autotrim thread
The autotrim thread only reads zfs_trim_extent_bytes_min and
zfs_trim_extent_bytes_max variable only on thread start.  We
should check for parameter changes during thread execution to
allow parameter changes take effect without needing to disable
then restart the autotrim.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Václav Skála <skala@vshosting.cz>
Closes #14077
2022-11-01 12:48:23 -07:00
Serapheim Dimitropoulos
37d5a3e04b Stop ganging due to past vdev write errors
= Problem

While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
	/*
	 * Avoid writing single-copy data to a failing,
	 * non-redundant vdev, unless we've already tried all
	 * other vdevs.
	 */
	if ((vd->vdev_stat.vs_write_errors > 0 ||
	    vd->vdev_state < VDEV_STATE_HEALTHY) &&
	    d == 0 && !try_hard && vd->vdev_children == 0) {
		metaslab_trace_add(zal, mg, NULL, psize, d,
		    TRACE_VDEV_ERROR, allocator);
		goto next;
	}
```

= Proposed Solution

Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).

= Testing

I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME        SIZE  ALLOC   FREE
volpool    22.5G   117M  22.4G
  xvdb     7.99G  40.2M  7.46G
  xvdc     7.99G  39.1M  7.46G
  xvdd     7.99G  37.8M  7.46G
```

And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```

And got the vdevs to the following state:
```
$ zpool status volpool
  pool: volpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:

	NAME        STATE     READ WRITE CKSUM
	volpool     ONLINE       0     0     0
	  xvdb      ONLINE       0     1     0
	  xvdc      ONLINE       0     1     0
	  xvdd      ONLINE       0     4     0

```

I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0  # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```

Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
        ganged count:              1384

$ sudo zdb volpool | grep gang
        ganged count:              1393

$ sudo zdb volpool | grep gang
        ganged count:              1402

$ sudo zdb volpool | grep gang
        ganged count:              1414
```

Then I updated my bits with this patch and the gang count stayed the
same.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14003
2022-11-01 12:36:25 -07:00
Serapheim Dimitropoulos
37763ea2a6 Fix panic in dsl_process_sub_livelist for EINTR
= Issue

Recently we hit an assertion panic in `dsl_process_sub_livelist` while
exporting the spa and interrupting `bpobj_iterate_nofree`. In that case
`bpobj_iterate_nofree` stops mid-way returning an EINTR without clearing
the intermediate AVL tree that keeps track of the livelist entries it
has encountered so far. At that point the code has a VERIFY for the
number of elements of the AVL expecting it to be zero (which is not the
case for EINTR).

= Fix

Cleanup any intermediate state before destroying the AVL when
encountering EINTR. Also added a comment documenting the scenario where
the EINTR comes up. There is no need to do anything else for the calles
of `dsl_process_sub_livelist` as they already handle the EINTR case.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #13939
2022-11-01 12:34:08 -07:00
Mateusz Guzik
c8d6a91a99 Bring per_txg_dirty_frees_percent back to 30
The current value causes significant artificial slowdown during mass
parallel file removal, which can be observed both on FreeBSD and Linux
when running real workloads.

Sample results from Linux doing make -j 96 clean after an allyesconfig
modules build:

before: 4.14s user 6.79s system 48% cpu 22.631 total
after:	4.17s user 6.44s system 153% cpu 6.927 total

FreeBSD results in the ticket.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by:	Mateusz Guzik <mjguzik@gmail.com>
Closes #13932
Closes #13938
2022-11-01 12:32:40 -07:00
Akash B
7ac732b8d6 Add options to zfs redundant_metadata property
Currently, additional/extra copies are created for metadata in
addition to the redundancy provided by the pool(mirror/raidz/draid),
due to this 2 times more space is utilized per inode and this decreases
the total number of inodes that can be created in the filesystem. By
setting redundant_metadata to none, no additional copies of metadata
are created, hence can reduce the space consumed by the additional
metadata copies and increase the total number of inodes that can be
created in the filesystem.  Additionally, this can improve file create
performance due to the reduced amount of metadata which needs
to be written.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #13680
2022-11-01 12:25:58 -07:00
samwyc
fc1c0053f9 Fix sequential resilver drive failure race condition
This patch handles the race condition on simultaneous failure of
2 drives, which misses the vdev_rebuild_reset_wanted signal in
vdev_rebuild_thread. We retry to catch this inside the
vdev_rebuild_complete_sync function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Samuel Wycliffe J <samwyc@hpe.com>
Closes #14041
Closes #14050
2022-10-21 14:05:06 -07:00
Alexander Motin
33223cbc3c Refactor Log Size Limit
Original Log Size Limit implementation blocked all writes in case of
limit reached until the TXG is committed and the log is freed.  It
caused huge delays and following speed spikes in application writes.

This implementation instead smoothly throttles writes, using exactly
the same mechanism as used for dirty data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: jxdking <lostking2008@hotmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Issue #12284
Closes #13476
2022-09-26 14:55:27 -07:00
Brian Behlendorf
91e02156dd Revert "Reduce dbuf_find() lock contention"
This reverts commit 34dbc618f5.  While this
change resolved the lock contention observed for certain workloads, it
inadventantly reduced the maximum hash inserts/removes per second.  This
appears to be due to the slightly higher acquisition cost of a rwlock vs
a mutex.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2022-09-21 13:15:51 -07:00
Richard Yao
b66f8d3c2b Add zfs_btree_verify_intensity kernel module parameter
I see a few issues in the issue tracker that might be aided by being
able to turn this on. We have no module parameter for it, so I would
like to add one.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13874
2022-09-21 13:15:51 -07:00
Richard Yao
5096ed31c8 Fix incorrect size given to bqueue_enqueue() call in dmu_redact.c
We pass sizeof (struct redact_record *) rather than sizeof (struct
redact_record). Passing the pointer size is wrong.

Coverity caught this in two places.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13885
2022-09-21 13:15:51 -07:00
Ameer Hamza
035e52f591 Delay ZFS_PROP_SHARESMB property to handle it for encrypted raw receive
For encrypted raw receive, objset creation is delayed until a call to
dmu_recv_stream(). ZFS_PROP_SHARESMB property requires objset to be
populated when calling zpl_earlier_version(). To correctly handle the
ZFS_PROP_SHARESMB property for encrypted raw receive, this change
delays setting the property.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13878
2022-09-21 13:15:26 -07:00
Alexander Motin
44cec45f72 Improve too large physical ashift handling
When iterating through children physical ashifts for vdev, prefer
ones above the maximum logical ashift, that we can actually use,
but within the administrator defined maximum.

When selecting top-level vdev ashift, do not set it to the defined
maximum in case physical ashift is even higher, but just ignore one.
Using the maximum does not prevent misaligned writes, but reduces
space efficiency.  Since ZFS tries to write data sequentially and
aggregates the writes, in many cases large misanigned writes may be
not as bad as the space penalty otherwise.

Allow internal physical ashifts for vdevs higher than SHIFT_MAX.
May be one day allocator or aggregation could benefit from that.

Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB),
so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB),
but only if it really has to or explicitly told to, but not as an
"optimization".

There are some read-intensive NVMe SSDs that report Preferred Write
Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a
space inefficiency that can't be justified.  Instead these changes
make ZFS fall back to logical ashift of 12 (4KB) by default and
only warn user that it may be suboptimal for performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #13798
2022-09-21 13:15:15 -07:00
Kevin Jin
d05f3039f7 Add Module Parameter Regarding Log Size Limit
zfs_wrlog_data_max
The upper limit of TX_WRITE log data. Once it is reached,
write operation is blocked, until log data is cleared out
after txg sync. It only counts TX_WRITE log with WR_COPIED
or WR_NEED_COPY.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes #12284
2022-09-21 16:12:14 -07:00
Kevin Jin
999830a021 Optimize txg_kick() process (#12274)
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().

Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is
also cleaned up.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes #12274
2022-09-21 16:12:14 -07:00
Ameer Hamza
a5b0d42540 zfs recv hangs if max recordsize is less than received recordsize
- Some optimizations for bqueue enqueue/dequeue.
- Added a fix to prevent deadlock when both bqueue_enqueue_impl()
and bqueue_dequeue() waits for signal to be triggered.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13855
2022-09-19 09:39:07 -07:00
Richard Yao
3f7c174b50 vdev_draid_lookup_map() should not iterate outside draid_maps
Coverity reported this as an out-of-bounds read.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13865
2022-09-15 16:58:35 -07:00
Akash B
03fa3ef264 Add physical device size to SIZE column in 'zpool list -v'
Add physical device size/capacity only for physical devices in
'zpool list -v' instead of displaying "-" in the SIZE column.
This would make it easier to see the individual device capacity and
to determine which spares are large enough to replace which devices.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #12561
Closes #13106
2022-09-15 10:23:01 -07:00
George Amanakis
8bd3dca9bf Introduce a tunable to exclude special class buffers from L2ARC
Special allocation class or dedup vdevs may have roughly the same
performance as L2ARC vdevs. Introduce a new tunable to exclude those
buffers from being cacheable on L2ARC.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #11761
Closes #12285
2022-09-14 11:27:00 -07:00
Alexander Motin
b6ebf270eb Apply arc_shrink_shift to ARC above arc_c_min
It makes sense to free memory in smaller chunks when approaching
arc_c_min to let other kernel subsystems to free more, since after
that point we can't free anything.  This also matches behavior on
Linux, where to shrinker reported only the size above arc_c_min.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #13794
2022-09-13 17:59:10 -07:00
Richard Yao
8131a96544 Fix use-after-free in btree code
Coverty static analysis found these.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #10989
Closes #13861
2022-09-13 16:15:38 -07:00
Rich Ercolani
035ee628cf Revert behavior of 59eab109 on not-Linux
It turns out that short-circuiting the EFAULT behavior on a short read
breaks things on FreeBSD. So until there's a nicer solution, let's
just revert the behavior for not-Linux.

Reference:
https://reviews.freebsd.org/R10:70f51f0e474ffe1fb74cb427423a2fba3637544d

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12698
2022-08-02 10:05:14 -07:00