This allows ZFS datasets to be delegated to a user/mount namespace
Within that namespace, only the delegated datasets are visible
Works very similarly to Zones/Jailes on other ZFS OSes
As a user:
```
$ unshare -Um
$ zfs list
no datasets available
$ echo $$
1234
```
As root:
```
# zfs list
NAME ZONED MOUNTPOINT
containers off /containers
containers/host off /containers/host
containers/host/child off /containers/host/child
containers/host/child/gchild off /containers/host/child/gchild
containers/unpriv on /unpriv
containers/unpriv/child on /unpriv/child
containers/unpriv/child/gchild on /unpriv/child/gchild
# zfs zone /proc/1234/ns/user containers/unpriv
```
Back to the user namespace:
```
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
containers 129M 47.8G 24K /containers
containers/unpriv 128M 47.8G 24K /unpriv
containers/unpriv/child 128M 47.8G 128M /unpriv/child
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Buddy <https://buddy.works>
Closes#12263
Add support for the kernel's block multiqueue (blk-mq) interface in
the zvol block driver. blk-mq creates multiple request queues on
different CPUs rather than having a single request queue. This can
improve zvol performance with multithreaded reads/writes.
This implementation uses the blk-mq interfaces on 4.13 or newer
kernels. Building against older kernels will fall back to the
older BIO interfaces.
Note that you must set the `zvol_use_blk_mq` module param to
enable the blk-mq API. It is disabled by default.
In addition, this commit lets the zvol blk-mq layer process whole
`struct request` IOs at a time, rather than breaking them down
into their individual BIOs. This reduces dbuf lock contention
and overhead versus the legacy zvol submit_bio() codepath.
sequential dd to one zvol, 8k volblocksize, no O_DIRECT:
legacy submit_bio() 292MB/s write 453MB/s read
this commit 453MB/s write 885MB/s read
It also introduces a new `zvol_blk_mq_chunks_per_thread` module
parameter. This parameter represents how many volblocksize'd chunks
to process per each zvol thread. It can be used to tune your zvols
for better read vs write performance (higher values favor write,
lower favor read).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#13148
Issue #12483
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes#10058Closes#12918
Commit 63b18e4 fixed an issue in zpl_aio_write() to make sure that
kiocb->ki_pos was updated correctly when opening a file with O_APPEND.
Adding a test to verify O_APPEND functionality with lseek can make
sure that all other distros/kernel versions also have the correct
behavior.
Also moved the threadappends_001_pos test into this append test
directory in functional ZTS directory. This way the two append tests
are together for organization purposes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#13424
Only down to tests/zfs-tests/tests, but pull out C programs into the
main Makefile ‒ this means we get correct dependency tracking for all
programs (and parallelise across them)
dist diff:
-zfs-2.1.99/tests/zfs-tests/tests/stress/
-zfs-2.1.99/tests/zfs-tests/tests/stress/Makefile.am
-zfs-2.1.99/tests/zfs-tests/tests/stress/Makefile.in
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13316
No installation diff, dist lost
-zfs-2.1.99/cmd/fsck_zfs/fsck.zfs
which was distributed erroneously, since it's generated
Also clean gitrev on clean
Also add -e 'any possible bashisms' to default checkbashisms flags,
and fully parallelise it and shellcheck, and it works out-of-tree, too
Also align the Release in the dist META file correctly
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13316
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.
This commit deals with 2 cases:
- No page writeback is active. A WB_SYNC_ALL page writeback starts
and even completes. But when it's about to check if the PG_writeback
bit has been cleared, another writeback with WB_SYNC_NONE starts.
The sync page writeback ends up waiting for the non-sync page
writeback to complete.
- A page writeback with WB_SYNC_NONE is already active when a
WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
waiting for the WB_SYNC_NONE writeback.
The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes#12662Closes#12790
It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.
Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13329Closes#13338
Originally it was thought it would be useful to split up the kmods
by functionality. This would allow external consumers to only load
what was needed. However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward. Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13274
As found by
git -C tests/ grep ^function | grep -vFe '.lua:' -e '.zcp:' | while IFS=":$IFS" read -r _ _ fn _; do [ $(git -C tests/ grep -wF $fn | head -2 | wc -l) -eq 1 ] && echo $fn; done
after all rounds this comes out to, sorted:
check_slog_state
chgusr_exec
cksum_files
cleanup_pools
compare_modes
count_ACE
dataset_set_defaultproperties
ds_is_snapshot
get_ACE
get_group
get_min
get_mode
get_owner
get_rand_checksum
get_rand_checksum_any
get_rand_large_recsize
get_rand_recsize
get_user_group
getitem
indirect_vdev_mapping_size
is_dilos
log_noresult
log_notinuse
log_other
log_timed_out
log_uninitiated
log_warning
num_jobs_by_cpu
plus_sign_check_l
plus_sign_check_v
record_cksum
rwx_node
seconds_mmp_waits_for_activity
set_cur_usr
setup_mirrors
setup_raidzs
showshares_smb
zfs_zones_setup
This, of course, doesn't catch recursive ones, or ones that log with
their own function name as a prefix, but
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13259
Otherwise, they leak past the tests and contaminate the running system,
breaking coredumps entirely
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Yannick Le Pennec <yannick.lepennec@live.fr>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13259
The regular default_raidz_setup function in the ZFS test suite called
log_pass after creating the zpool. However, with compression now being
on by default 56fa4aa, there is no way to turn compression off in the
setup.ksh scripts when creating a raidz VDEV. The addition of the
function default_raidz_setup_noexit allows for a raidz VDEV to be
created, additional zfs property settings to be applied and for the
setup.ksh script itself to call log_pass.
With the addition of default_raidz_setup_noexit some stray log_pass
calls were removed from any setup.ksh scripts that call
default_raidz_setup.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#13173
ZFS allows to update and retrieve additional file level attributes for
FreeBSD. This commit allows additional file level attributes to be
updated and retrieved for Linux. These include the flags stored in the
upper half of z_pflags only.
Two new IOCTLs have been added for this purpose. ZFS_IOC_GETDOSFLAGS
can be used to retrieve the attributes, while ZFS_IOC_SETDOSFLAGS can
be used to update the attributes.
Attributes that are allowed to be updated include ZFS_IMMUTABLE,
ZFS_APPENDONLY, ZFS_NOUNLINK, ZFS_ARCHIVE, ZFS_NODUMP, ZFS_SYSTEM,
ZFS_HIDDEN, ZFS_READONLY, ZFS_REPARSE, ZFS_OFFLINE and ZFS_SPARSE.
Flags can be or'd together while calling ZFS_IOC_SETDOSFLAGS.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#13118
A simple change, but so many tests break with it,
and those are the majority of this.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13078
While "diff -r" is the most straightforward way of comparing directory
trees for differences, it has two major issues:
* File metadata is not compared, which means that subtle bugs may be
missed even if a test is written that exercises the buggy behaviour.
* diff(1) doesn't know how to compare special files -- it assumes they
are always different, which means that a test using diff(1) on
special files will always fail (resulting in such tests not being
added).
rsync can be used in a very similar manner to diff (with the -ni flags),
but has the additional benefit of being able to detect and resolve many
more differences between directory trees. In addition, rsync has a
standard set of features and flags while diffs feature set depends on
whether you're using GNU or BSD binutils.
Note that for several of the test cases we expect that file timestamps
will not match. For example, the ctime for a file creation or modify
event is stored in the intent log but not the mtime. Thus when replaying
the log the correct ctime is set but the current mtime is used. This is
the expected behavior, so to prevent these tests from failing, there's a
replay_directory_diff function which ignores those kinds of changes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#12588
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.
This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.
Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:
1. The feature must be enabled. Otherwise, we don't log the new
record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
the ZIL chain will activate the feature for the dataset. Likewise
for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.
We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.
We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.
One more piece on feature activation, since it's spread across
multiple functions:
zil_commit()
zil_process_commit_list()
if lwb == NULL // first zil_commit since zil_open
zil_create()
if no log block pointer in ZIL header:
if feature enabled and not active:
// CASE 1
enable, COALESCE txg wait with dmu_tx that allocated the
log block
else // log block was allocated earlier than this zil_open
if feature enabled and not active:
// CASE 2
enable, EXPLICIT txg wait
else // already have an in-DRAM LWB
if feature enabled and not active:
// this happens when we enable the feature after zil_create
// CASE 3
enable, EXPLICIT txg wait
[1] da6c28aaf6
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#8768Closes#9078
ZFS on Linux originally implemented xattr namespaces in a way that is
incompatible with other operating systems. On illumos, xattrs do not
have namespaces. Every xattr name is visible. FreeBSD has two
universally defined namespaces: EXTATTR_NAMESPACE_USER and
EXTATTR_NAMESPACE_SYSTEM. The system namespace is used for protected
FreeBSD-specific attributes such as MAC labels and pnfs state. These
attributes have the namespace string "freebsd:system:" prefixed to the
name in the encoding scheme used by ZFS. The user namespace is used
for general purpose user attributes and obeys normal access control
mechanisms. These attributes have no namespace string prefixed, so
xattrs written on illumos are accessible in the user namespace on
FreeBSD, and xattrs written to the user namespace on FreeBSD are
accessible by the same name on illumos.
Linux has several xattr namespaces. On Linux, ZFS encodes the
namespace in the xattr name for every namespace, including the user
namespace. As a consequence, an xattr in the user namespace with the
name "foo" is stored by ZFS with the name "user.foo" and therefore
appears on FreeBSD and illumos to have the name "user.foo" rather than
"foo". Conversely, none of the xattrs written on FreeBSD or illumos
are accessible on Linux unless the name happens to be prefixed with one
of the Linux xattr namespaces, in which case the namespace is stripped
from the name. This makes xattrs entirely incompatible between Linux
and other platforms.
We want to make the encoding of user namespace xattrs compatible across
platforms. A critical requirement of this compatibility is for xattrs
from existing pools from FreeBSD and illumos to be accessible by the
same names in the user namespace on Linux. It is also necessary that
existing pools with xattrs written by Linux retain access to those
xattrs by the same names on Linux. Making user namespace xattrs from
Linux accessible by the correct names on other platforms is important.
The handling of other namespaces is not required to be consistent.
Add a fallback mechanism for listing and getting xattrs to treat xattrs
as being in the user namespace if they do not match a known prefix.
Do not allow setting or getting xattrs with a name that is prefixed
with one of the namespace names used by ZFS on supported platforms.
Allow choosing between legacy illumos and FreeBSD compatibility and
legacy Linux compatibility with a new tunable. This facilitates
replication and migration of pools between hosts with different
compatibility needs.
The tunable controls whether or not to prefix the namespace to the
name. If the xattr is already present with the alternate prefix,
remove it so only the new version persists. By default the platform's
existing convention is used.
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11919
`configure` now accepts `--enable-asan` and `--enable-ubsan` switches
which results in passing `-fsanitize=address`
and `-fsanitize=undefined`, respectively, to the compiler. Those
flags are enabled in GitHub workflows for ZTS and zloop. Errors
reported by both instrumentations are corrected, except for:
- Memory leak reporting is (temporarily) suppressed. The cost of
fixing them is relatively high compared to the gains.
- Checksum computing functions in `module/zcommon/zfs_fletcher*`
have UBSan errors suppressed. It is completely impractical
to enforce 64-byte payload alignment there due to performance
impact.
- There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom
memory allocator is used there rendering that measure
unfeasible.
- Memory leaks detection has to be suppressed for `cmd/zvol_id`.
`zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is
incompatible with memory leaks detection.
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#12928
This commit adds enumerated names to disambiguate between the
different vdevs. Previously only 'zpool status' showed enumerated
vdev names, now 'zpool list -v' and 'zpool iostat -v' also shows
the enumerated vdev names.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#12510Closes#13031
Deprecation of Python versions below 3.6 gives opportunity to unify the
build and install requirements for OpenZFS packages. The minimal
supported Python version is 3.6 as this is the most recent Python
package CentOS/RHEL 7 users can get.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#12925
- Replaces use of manual `zpool sync`
- Don't use `log_must sync_pool` as `sync_pool` uses it internally
- Replace many (but not all) uses of `sync` with `sync_pool`
This makes the tests more consistent, and makes searching easier.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#12894
Provide access to file generation number on Linux.
Add test coverage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#12856
In case if all label checksums will be invalid on any vdev, the pool
will become unimportable. From other side zdb with -l option will not
provide any useful information why it happened. Add notifications
about corrupted label checksums.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#2509Closes#12685
When using lseek(2) to report data/holes memory mapped regions of
the file were ignored. This could result in incorrect results.
To handle this zfs_holey_common() was updated to asynchronously
writeback any dirty mmap(2) regions prior to reporting holes.
Additionally, while not strictly required, the dn_struct_rwlock is
now held over the dirty check to prevent the dnode structure from
changing. This ensures that a clean dnode can't be dirtied before
the data/hole is located. The range lock is now also taken to
ensure the call cannot race with zfs_write().
Furthermore, the code was refactored to provide a dnode_is_dirty()
helper function which checks the dnode for any dirty records to
determine its dirtiness.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11900Closes#12724
The zvol_misc tests, in particular zvol_misc_volmode, make use of a
common udev_wait function to wait for zvol devices in /dev to quiesce
on Linux. On other platforms this function currently only sleeps for
one second before returning. This is insufficient, and
zvol_misc_volmode has been flaky on FreeBSD as a result.
Replace udev_wait with block_device_wait, passing through the optional
device parameter where possible. Rearrange a few checks to strengthen
the verifications we are making and avoid unnecessarily sleeping. We
must keep udev_wait in a couple places to pass in Github CI workflows.
Remove zvol_misc_volmode from the maybe failing tests on FreeBSD in
zts-report.py.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#12583
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ka Ho Ng <khng@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes#12458
The ZTS block_device_wait helper function should use -e when waiting
for a file to appear since it will be either a block special device
or a symlink. This didn't cause any failures but when a device path
was specified the function would wait longer than needed.
Additionally update the most flakey test cases to pass the file path
to block_device_wait to try and improve the test reliability. The
udev behavior on Fedora in particular can result in frequent false
positives.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12515
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#12432
- Bail out early if we're running the perf tests and forget to
specify disks.
- Allow perf tests to run with any number of disks.
- Remove weekly vs. nightly settings
- Move variables with common values to perf.shlib
- Use zinject to clear the ARC over export/import
- Fix dbuf cache size calculation
When the meaning of `dbuf_cache_max_bytes` changed, the performance
test that covers the dbuf cache started to fail. The test would try to
write files for the test using the max possible size of the cache,
inevitably filling the pool and failing. This change uses
`dbuf_cache_shift` to correctly calculate the dbuf cache size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#12408
zstreamdump was replaced with "zstream dump"; let's stop using the
old name, compat symlink or no.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12277
verify_slog_support no longer applies to ZFS since slog support
is always available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#12092
It's been observed in the CI that the required 25% of obsolete bytes
in the mapping can be to high a threshold for this test resulting in
condensing never being triggered and a test failure. To prevent these
failures make the existing zfs_condense_indirect_obsolete_pct tuning
available so the obsolete percentage can be reduced from 25% to 5%
during this test.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11869
200ms time-out is relatively long, but if we already hit the cap,
then we'll likely be able to spawn multiple new jobs when we wake up
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#11807
The current user_run often does not work as expected. Commands are run
in a different shell, with a different environment, and all output is
discarded.
Simplify user_run to retain the current environment, eliminate eval,
and feed the command string into ksh. Enhance the logging for
user_run so we can see out and err.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11185
A few deadman tunables ended up in the wrong sysctl node.
Move them to vfs.zfs.deadman.*
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11715
The behavior of a NULL fromsnap was inadvertently changed for a doall
send when the send/recv logic in libzfs was updated. Restore the
previous behavior by correcting send_iterate_snap() to include all
the snapshots in the nvlist for this case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Cedric Maunoury <cedric.maunoury@gmail.com>
Closes#11608
Property to allow sets of features to be specified; for compatibility
with specific versions / releases / external systems. Influences
the behavior of 'zpool upgrade' and 'zpool create'. Initial man
page changes and test cases included.
Brief synopsis:
zpool create -o compatibility=off|legacy|file[,file...] pool vdev...
compatibility = off : disable compatibility mode (enable all features)
compatibility = legacy : request that no features be enabled
compatibility = file[,file...] : read features from specified files.
Only features present in *all* files will be enabled on the
resulting pool. Filenames may be absolute, or relative to
/etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc
checked first).
Only affects zpool create, zpool upgrade and zpool status.
ABI changes in libzfs:
* New function "zpool_load_compat" to load and parse compat sets.
* Add "zpool_compat_status_t" typedef for compatibility parse status.
* Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum
* Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum
An initial set of base compatibility sets are included in
cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is
modified to install these in $pkgdatadir/compatibility.d and to
create symbolic links to a reasonable set of aliases.
Reviewed-by: ericloewe
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes#11468
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10102
Convert dynamic allocation to static buffer, simplify parse_dataset
function return path. Add tests specific to the mount helper.
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sterling Jensen <sterlingjensen@users.noreply.github.com>
Closes#11098
The second part of list_file_blocks transforms the object description
output by zdb -ddddd $ds $objnum into a stream of lines of the form
"level path offset length" for the indirect blocks in the given file.
The current code only works for the first copy of L0 blocks. L1 and
L2 indirect blocks have more than one copy on disk.
Add one more -d to the zdb command so we get all block copies and
rewrite the transformation to match more than L0 and output all DVAs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11141
The first part of list_file_blocks transforms the pool configuration
output by zdb -C $pool into shell code to set up a shell variable,
VDEV_MAP, that maps from vdev id to the underlying vdev path. This
variable is a simple indexed array. However, the vdev id in a DVA is
only the id of the top level vdev.
When the pool is mirrored, the top level vdev is a mirror and its
children are the mirrored devices. So, what we need is to map from
the top level vdev id to a list of the underlying vdev paths.
ist_file_blocks does not need to work for raidz vdevs, so we can
disregard that case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11141
A zpool_influxdb command is introduced to ease the collection
of zpool statistics into the InfluxDB time-series database.
Examples are given on how to integrate with the telegraf
statistics aggregator, a companion to influxdb.
Finally, a grafana dashboard template is included to show
how pool latency distributions can be visualized in a
ZFS + telegraf + influxdb + grafana environment.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes#10786
With procfs_list kstats implemented for FreeBSD, dbufs are now exposed
as kstat.zfs.misc.dbufs.
On FreeBSD, dbufstats can use the sysctl instead of procfs when no
input file has been given.
Enable the dbufstats tests on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#11008
Instead of relying on arbitrary timers after pool export/import or cache
device off/online rely on arcstats. This makes the L2ARC tests more
robust. Also cleanup some functions related to persistent L2ARC.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10983
When an invalid incremental send is requested where the "to" ds is
before the "from" ds, make sure to drop the reference to the pool
and the dataset before returning the error.
Add an assert on FreeBSD to make sure we don't hold any locks after
returning from an ioctl.
Add some test coverage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10919
== Motivation and Context
The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.
In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.
== Description
This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes#10932
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10743
Duplicate io and checksum ereport events can misrepresent that
things are worse than they seem. Ideally the zpool events and the
corresponding vdev stat error counts in a zpool status should be
for unique errors -- not the same error being counted over and over.
This can be demonstrated in a simple example. With a single bad
block in a datafile and just 5 reads of the file we end up with a
degraded vdev, even though there is only one unique error in the pool.
The proposed solution to the above issue, is to eliminate duplicates
when posting events and when updating vdev error stats. We now save
recent error events of interest when posting events so that we can
easily check for duplicates when posting an error.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#10861
Use ZFS_MODULE_PARAM for cross-platform tunables in spa_stats.c, and
add update tunables.cfg in tests for the newly supported ones.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10858
FreeBSD doesn't have an equivalent to udevadm settle, so we have been
resorting to a three second sleep to wait for device changes to take
effect. This is far from ideal.
We are mainly waiting for volmode=geom zvols to appear in /dev, so as
a hack, reading the geom config will have the desired effect of
quiescing the geom state.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10768
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#6247Closes#9024Closes#10277Closes#10278
Bring zfs-tests.sh in to compliance with the other scripts
by converting it /bin/sh for to avoid a dependency on bash.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10640
zfs_jail was not using zfs_ioctl so failed to map the IOC number
correctly. Use zfs_ioctl to perform the jail ioctl and add a test
case for FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10658
The `zfs program` subcommand invokes a LUA interpreter to run ZFS
"channel programs". This interpreter runs in a constrained environment,
with defined memory limits. The LUA stack (used for LUA functions that
call each other) is allocated in the kernel's heap, and is limited by
the `-m MEMORY-LIMIT` flag and the `zfs_lua_max_memlimit` module
parameter. The C stack is used by certain LUA features that are
implemented in C. The C stack is limited by `LUAI_MAXCCALLS=20`, which
limits call depth.
Some LUA C calls use more stack space than others, and `gsub()` uses an
unusually large amount. With a programming trick, it can be invoked
recursively using the C stack (rather than the LUA stack). This
overflows the 16KB Linux kernel stack after about 11 iterations, less
than the limit of 20.
One solution would be to decrease `LUAI_MAXCCALLS`. This could be made
to work, but it has a few drawbacks:
1. The existing test suite does not pass with `LUAI_MAXCCALLS=10`.
2. There may be other LUA functions that use a lot of stack space, and
the stack space may change depending on compiler version and options.
This commit addresses the problem by adding a new limit on the amount of
free space (in bytes) remaining on the C stack while running the LUA
interpreter: `LUAI_MINCSTACK=4096`. If there is less than this amount
of stack space remaining, a LUA runtime error is generated.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10611Closes#10613
FreeBSD stat uses -f to specify the format string rather than -c.
list_file_blocks in blkdev.shlib uses stat -c %i to get a file's
object ID for zdb. We already have a library function to do this
portably.
Use get_objnum to get the file's object ID.
Take log_must off of the call to list_free_blocks in
corrupt_blocks_at_level, which had masked the error. It was not good
to pipe the output of log_must into the while-loop, anyway.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10572
A bunch of places need to edit files to incorporate the configured paths
i.e. bindir, sbindir etc. Move this logic into a common file.
Create arc_summary by copying arc_summary[23] as appropriate at build
time instead of install time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10559
== Motivation and Context
The current implementation of 'sharenfs' and 'sharesmb' relies on
the use of the sharetab file. The use of this file is os-specific
and not required by linux or freebsd. Currently the code must
maintain updates to this file which adds complexity and presents
a significant performance impact when sharing many datasets. In
addition, concurrently running 'zfs sharenfs' command results in
missing entries in the sharetab file leading to unexpected failures.
== Description
This change removes the sharetab logic from the linux and freebsd
implementation of 'sharenfs' and 'sharesmb'. It still preserves an
os-specific library which contains the logic required for sharing
NFS or SMB. The following entry points exist in the vastly simplified
libshare library:
- sa_enable_share -- shares a dataset but may not commit the change
- sa_disable_share -- unshares a dataset but may not commit the change
- sa_is_shared -- determine if a dataset is shared
- sa_commit_share -- notify NFS/SMB subsystem to commit the shares
- sa_validate_shareopts -- determine if sharing options are valid
The sa_commit_share entry point is provided as a performance enhancement
and is not required. The sa_enable_share/sa_disable_share may commit
the share as part of the implementation. Libshare provides a framework
for both NFS and SMB but some operating systems may not fully support
these protocols or all features of the protocol.
NFS Operation:
For linux, libshare updates /etc/exports.d/zfs.exports to add
and remove shares and then commits the changes by invoking
'exportfs -r'. This file, is automatically read by the kernel NFS
implementation which makes for better integration with the NFS systemd
service. For FreeBSD, libshare updates /etc/zfs/exports to add and
remove shares and then commits the changes by sending a SIGHUP to
mountd.
SMB Operation:
For linux, libshare adds and removes files in /var/lib/samba/usershares
by calling the 'net' command directly. There is no need to commit the
changes. FreeBSD does not support SMB.
== Performance Results
To test sharing performance we created a pool with an increasing number
of datasets and invoked various zfs actions that would enable and
disable sharing. The performance testing was limited to NFS sharing.
The following tests were performed on an 8 vCPU system with 128GB and
a pool comprised of 4 50GB SSDs:
Scale testing:
- Share all filesystems in parallel -- zfs sharenfs=on <dataset> &
- Unshare all filesystems in parallel -- zfs sharenfs=off <dataset> &
Functional testing:
- share each filesystem serially -- zfs share -a
- unshare each filesystem serially -- zfs unshare -a
- reset sharenfs property and unshare -- zfs inherit -r sharenfs <pool>
For 'zfs sharenfs=on' scale testing we saw an average reduction in time
of 89.43% and for 'zfs sharenfs=off' we saw an average reduction in time
of 83.36%.
Functional testing also shows a huge improvement:
- zfs share -- 97.97% reduction in time
- zfs unshare -- 96.47% reduction in time
- zfs inhert -r sharenfs -- 99.01% reduction in time
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Bryant G. Ly <bryangly@gmail.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-68690
Closes#1603Closes#7692Closes#7943Closes#10300
FreeBSD recently replaced the GNU bc and dc in the base system with
BSD licensed versions. They are supposed to be compatible with all
the features present in the GNU versions, but it turns out they are
picky about `if` statements having a corresponding `else`. ZTS uses
`echo "if ($x > $y) 1" | bc` in a few places, which causes tests to
fail unexpectedly with the new bc.
Change the two expressions in ZTS to `if ($x > $y) 1 else 0` for
compatibility with the new BSD bc.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10551
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices. However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10349
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.
Add some files that were missing altogether in various Makefile's.
The changes to disted files in this commit (excluding deleted files):
+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10501
Implements a pam module for automatically loading zfs encryption keys
for home datasets. The pam module:
- loads a zfs key and mounts the dataset when a session opens.
- unmounts the dataset and unloads the key when the session closes.
- when the user is logged on and changes the password, the module
changes the encryption key.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: @jengelh <jengelh@inai.de>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Felix Dörre <felix@dogcraft.de>
Closes#9886Closes#9903
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.
Add the handler so the tunable can be declared in common code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10490
The horrible effects of human slavery continue to impact society. The
casual use of the term "slave" in computer software is an unnecessary
reference to a painful human experience.
This commit removes all possible references to the term "slave".
Implementation notes:
The zpool.d/slaves script is renamed to dm-deps, which uses the same
terminology as `dmsetup deps`.
References to the `/sys/class/block/$dev/slaves` directory remain. This
directory name is determined by the Linux kernel. Although
`dmsetup deps` provides the same information, it unfortunately requires
elevated privileges, whereas the `/sys/...` directory is world-readable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10435
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.
We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.
We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).
We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#9713Closes#9789Closes#10224
Small program that converts a dataset id and an object id to a path
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10204
When a resilver finishes, vdev_dtl_reassess is called to hopefully
excise DTL_MISSING (amongst other things). If there are errors during
the resilver, they are tracked in DTL_SCRUB, as spelled out in the
block comment in vdev.c. DTL_SCRUB is in-core only, so it can only
be used if the pool was online for the whole resilver. This state is
tracked with the spa_scrub_started flag, which only gets set when
the scan is initialized. Unfortunately, this flag gets cleared right
before vdev_dtl_reassess gets called, so if there are any errors
during the scan, DTL_MISSING will never get excised and the resilver
will just continually restart. This fix simply moves clearing that
flag until after the call to vdev_dtl_reasses.
In addition, if a pool is imported and already has scn_errors > 0,
this change will restart the resilver immediately instead of doing
the rest of the scan and then restarting it from the beginning. On
the other hand, if scn_errors == 0 at import, then no errors have
been encountered so far, so the spa_scrub_started flag can be safely
set.
A test has been added to verify that resilver does not restart when
relevant DTL's are available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes#10291
The verify_pool function should detect checksum errors on any vdev, but
it was only checking at the root of the pool.
Accumulate the errors for all vdevs to obtain the correct count.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10271
Remove some obsolete legacy compat, rename some misnamed, and add some
missing tunables for FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10203
Add the FreeBSD platform code to the OpenZFS repository. As of this
commit the source can be compiled and tested on FreeBSD 11 and 12.
Subsequent commits are now required to compile on FreeBSD and Linux.
Additionally, they must pass the ZFS Test Suite on FreeBSD which is
being run by the CI. As of this commit 1230 tests pass on FreeBSD
and there are no unexpected failures.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#898Closes#8987
Deduplicated send and receive is deprecated. To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.
The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand. The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.
The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.
Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)
For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).
For WRITE_BYREF records, we change them to WRITE records. We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file. This is why the stream can not be a pipe. The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).
This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10124Closes#10156
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#925Closes#1823Closes#2672Closes#3744Closes#9582
Increasing l2arc_write_size or l2arc_write_boost can result in
l2arc_write_buffers() not having enough space to perform its writes and
panic zio_write_phys().
Instead of resetting l2ad_hand to l2ad_start at the end of
l2arc_write_buffers() and not taking into account a possible
user-mediated increase of l2arc_write_max, we do this in l2arc_evict(),
right after l2arc_write_size() has run. If there is not enough space to
evict (ie we will exceed l2ad_end) we evict to the end of the device,
reset l2ad_hand to l2ad_start, set l2ad_first to 0 and iterate
l2arc_evict(). We avoid infinite iteration of l2arc_evict() by making
sure in l2arc_write_size() that l2ad_start + size does not exceed
l2ad_end.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10154
Don't echo the results of arithmetic expressions, it's not necessary.
Use hw.clockrate sysctl to get CPU freq instead of parsing dmesg.boot
for a line that might not even be there anymore.
Reduce bookkeeping in fill_fs, making it easier to follow.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10113
* Set geom debug flags in corrupt_blocks_at_level
* Use the right time zone for history tests
* Add missing commands.cfg entry for diskinfo
* Rewrite get_last_txg_synced to use zdb
* Don't check ulimits for sparse files
* Suspend removal before removing a vdev, not after
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10054
* Check for mountd in is_shared to avoid timeout when not running
* Enhance robustness of some cleanup functions
* Simplify atime lookup
* Skip sharenfs validation for now
* Don't add mountpoint property to inheritance validation on FreeBSD
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10047
zed_start may be called in places where zed is not
typically already running, but this is not a requirement
of the tests.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#9974
FreeBSD doesn't have a `share` command. It does have showmount.
Split the separate platform impls out of is_shared_impl.
Dispatch to the correct platform impl function from is_shared.
Eliminate the use of is_shared_impl from tests. is_shared works.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10037