Deduplicated send and receive is deprecated. To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.
The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand. The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.
The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.
Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)
For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).
For WRITE_BYREF records, we change them to WRITE records. We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file. This is why the stream can not be a pipe. The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).
This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`). A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10124Closes#10156
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#925Closes#1823Closes#2672Closes#3744Closes#9582
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.
Add warning messages when tunables are being ignored.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10157Closes#10158
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10178
Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified
the blk_alloc_queue() interface by updating it to take the request
queue as an argument. Add a wrapper function which accepts the new
arguments and internally uses the available interfaces.
Other minor changes include increasing the Linux-Maximum to 5.6 now
that 5.6 has been released. It was not bumped to 5.7 because this
release has not yet been finalized and is still subject to change.
Added local 'struct zvol_state_os *zso' variable to zvol_alloc.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10181Closes#10187
Added to prevent a possible deadlock, the following comments from
FreeBSD explain the issue. The comment describing vn_io_fault_uiomove:
/*
* Helper function to perform the requested uiomove operation using
* the held pages for io->uio_iov[0].iov_base buffer instead of
* copyin/copyout. Access to the pages with uiomove_fromphys()
* instead of iov_base prevents page faults that could occur due to
* pmap_collect() invalidating the mapping created by
* vm_fault_quick_hold_pages(), or pageout daemon, page laundry or
* object cleanup revoking the write access from page mappings.
*
* Filesystems specified MNTK_NO_IOPF shall use vn_io_fault_uiomove()
* instead of plain uiomove().
*/
This used for vn_io_fault which has the following motivation:
/*
* The vn_io_fault() is a wrapper around vn_read() and vn_write() to
* prevent the following deadlock:
*
* Assume that the thread A reads from the vnode vp1 into userspace
* buffer buf1 backed by the pages of vnode vp2. If a page in buf1 is
* currently not resident, then system ends up with the call chain
* vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
* vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
* which establishes lock order vp1->vn_lock, then vp2->vn_lock.
* If, at the same time, thread B reads from vnode vp2 into buffer buf2
* backed by the pages of vnode vp1, and some page in buf2 is not
* resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
*
* To prevent the lock order reversal and deadlock, vn_io_fault() does
* not allow page faults to happen during VOP_READ() or VOP_WRITE().
* Instead, it first tries to do the whole range i/o with pagefaults
* disabled. If all pages in the i/o buffer are resident and mapped,
* VOP will succeed (ignoring the genuine filesystem errors).
* Otherwise, we get back EFAULT, and vn_io_fault() falls back to do
* i/o in chunks, with all pages in the chunk prefaulted and held
* using vm_fault_quick_hold_pages().
*
* Filesystems using this deadlock avoidance scheme should use the
* array of the held pages from uio, saved in the curthread->td_ma,
* instead of doing uiomove(). A helper function
* vn_io_fault_uiomove() converts uiomove request into
* uiomove_fromphys() over td_ma array.
*
* Since vnode locks do not cover the whole i/o anymore, rangelocks
* make the current i/o request atomic with respect to other i/os and
* truncations.
*/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10177
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.
To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.
With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10179
Commit 379ca9c removed the check on aux devices to be block devices also
changing zfs_ioctl(hdl, ZFS_IOC_VDEV_ADD, ...) and
zfs_ioctl(hdl, ZFS_IOC_POOL_CREATE, ...) to never set ENOTBLK. This
change removes the dangling check for ENOTBLK that will never trigger.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Alex John <alex@stty.io>
Closes#10173
The delegate tests use `date(1)` to generate snapshot names, using
the format '%F-%T-%N' to get nanosecond resolution (since multiple
snapshots may be taken in the same second). '%N' is not portable, and
causes tests to fail on FreeBSD.
Since the only purpose these timestamps serve is to create a unique
name, simply use $RANDOM instead.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10170
Add a mechanism to wait for delete queue to drain.
When doing redacted send/recv, many workflows involve deleting files
that contain sensitive data. Because of the way zfs handles file
deletions, snapshots taken quickly after a rm operation can sometimes
still contain the file in question, especially if the file is very
large. This can result in issues for redacted send/recv users who
expect the deleted files to be redacted in the send streams, and not
appear in their clones.
This change duplicates much of the zpool wait related logic into a
zfs wait command, which can be used to wait until the internal
deleteq has been drained. Additional wait activities may be added
in the future.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9707
In zfs_write(), the loop continues to the next iteration without
accounting for partial copies occurring in uiomove_iov when
copy_from_user/__copy_from_user_inatomic return a non-zero status.
This results in "zfs: accessing past end of object..." in the
kernel log, and the write failing.
Account for partial copies and update uio struct before returning
EFAULT, leave a comment explaining the reason why this is done.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: ilbsmart <wgqimut@gmail.com>
Signed-off-by: Fabio Scaccabarozzi <fsvm88@gmail.com>
Closes#8673Closes#10148
== Summary ==
Prior to this change, sync writes to a zvol are processed serially.
This commit makes zvols process concurrently outstanding sync writes in
parallel, similar to how reads and async writes are already handled.
The result is that the throughput of sync writes is tripled.
== Background ==
When a write comes in for a zvol (e.g. over iscsi), it is processed by
calling `zvol_request()` to initiate the operation. ZFS is expected to
later call `BIO_END_IO()` when the operation completes (possibly from a
different thread). There are a limited number of threads that are
available to call `zvol_request()` - one one per iscsi client (unless
using MC/S). Therefore, to ensure good performance, the latency of
`zvol_request()` is important, so that many i/o operations to the zvol
can be processed concurrently. In other words, if the client has
multiple outstanding requests to the zvol, the zvol should have multiple
outstanding requests to the storage hardware (i.e. issue multiple
concurrent `zio_t`'s).
For reads, and async writes (i.e. writes which can be acknowledged
before the data reaches stable storage), `zvol_request()` achieves low
latency by dispatching the bulk of the work (including waiting for i/o
to disk) to a taskq. The taskq callback (`zvol_read()` or
`zvol_write()`) blocks while waiting for the i/o to disk to complete.
The `zvol_taskq` has 32 threads (by default), so we can have up to 32
concurrent i/os to disk in service of requests to zvols.
However, for sync writes (i.e. writes which must be persisted to stable
storage before they can be acknowledged, by calling `zil_commit()`),
`zvol_request()` does not use `zvol_taskq`. Instead it blocks while
waiting for the ZIL write to disk to complete. This has the effect of
serializing sync writes to each zvol. In other words, each zvol will
only process one sync write at a time, waiting for it to be written to
the ZIL before accepting the next request.
The same issue applies to FLUSH operations, for which `zvol_request()`
calls `zil_commit()` directly.
== Description of change ==
This commit changes `zvol_request()` to use
`taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations.
Therefore we can have up to 32 threads (the taskq threads)
simultaneously calling `zil_commit()`, for a theoretical performance
improvement of up to 32x.
To avoid the locking issue described in the comment (which this commit
removes), we acquire the rangelock from the taskq callback (e.g.
`zvol_write()`) rather than from `zvol_request()`. This applies to all
writes (sync and async), reads, and discard operations. This means that
multiple simultaneously-outstanding i/o's which access the same block
can complete in any order. This was previously thought to be incorrect,
but a review of the block device interface requirements revealed that
this is fine - the order is inherently not defined. The shorter hold
time of the rangelock should also have a slight performance improvement.
For an additional slight performance improvement, we use
`taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a
`kmem_alloc()` and eliminates a failure mode. This applies to all
writes (sync and async), reads, and discard operations.
== Performance results ==
We used a zvol as an iscsi target (server) for a Windows initiator
(client), with a single connection (the default - i.e. not MC/S).
We used `diskspd` to generate a workload with 4 threads, doing 1MB
writes to random offsets in the zvol. Without this change we get
231MB/s, and with the change we get 728MB/s, which is 3.15x the original
performance.
We ran a real-world workload, restoring a MSSQL database, and saw
throughput 2.5x the original.
We saw more modest performance wins (typically 1.5x-2x) when using MC/S
with 4 connections, and with different number of client threads (1, 8,
32).
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10163
Increasing l2arc_write_size or l2arc_write_boost can result in
l2arc_write_buffers() not having enough space to perform its writes and
panic zio_write_phys().
Instead of resetting l2ad_hand to l2ad_start at the end of
l2arc_write_buffers() and not taking into account a possible
user-mediated increase of l2arc_write_max, we do this in l2arc_evict(),
right after l2arc_write_size() has run. If there is not enough space to
evict (ie we will exceed l2ad_end) we evict to the end of the device,
reset l2ad_hand to l2ad_start, set l2ad_first to 0 and iterate
l2arc_evict(). We avoid infinite iteration of l2arc_evict() by making
sure in l2arc_write_size() that l2ad_start + size does not exceed
l2ad_end.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10154
udev is only used on Linux.
Skip udev_wait and udev_cleanup when not on Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10165
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator. Other platforms
do not require the same logic.
Implement an arc_default_max() function to determine a default max ARC
size in platform code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10155
Make the cityhash code compile into libzfs, in preparation for the new
"zstream" command.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10152
And in removal tests, sync the specific pool we are waiting on.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10146
These paths are never exercised, as the parameters given are always
different cipher and plaintext `crypto_data_t` pointers.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fueloep <attila@fueloep.org>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes#9661Closes#10015
Issue #10090 reported that snapshots created between midnight and 1 AM
are missing a padded zero in the creation property
This change fixes the bug reported in issue #10090 where snapshots
created between midnight and 1 AM were missing a padded zero in the
creation timestamp output.
The leading zero was missing because the time format string used `%k`
which formats the hour as a decimal number from 0 to 23 where single
digits are preceded by blanks[0] and is fixed by changing it to `%H`
which formats the hour as 00-23.
The difference in output is as below
```
-Thu Mar 26 0:39 2020
+Thu Mar 26 00:39 2020
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alex John <alex@stty.io>
Closes#10090Closes#10153
Dedup send can only deduplicate over the set of blocks in the send
command being invoked, and it does not take advantage of the dedup table
to do so. This is a very common misconception among not only users, but
developers, and makes the feature seem more useful than it is. As a
result, many users are using the feature but not getting any benefit
from it.
Dedup send requires a nontrivial expenditure of memory and CPU to
operate, especially if the dataset(s) being sent is (are) not already
using a dedup-strength checksum.
Dedup send adds developer burden. It expands the test matrix when
developing new features, causing bugs in released code, and delaying
development efforts by forcing more testing to be done.
As a result, we are deprecating the use of `zfs send -D` and receiving
of such streams. This change adds a warning to the man page, and also
prints the warning whenever dedup send or receive are used.
In a future release, we plan to:
1. remove the kernel code for generating deduplicated streams
2. make `zfs send -D` generate regular, non-deduplicated streams
3. remove the kernel code for receiving deduplicated streams
4. make `zfs receive` of deduplicated streams process them in userland
to "re-duplicate" them, so that they can still be received.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7887Closes#10117
libzfs aborts and dumps core on EINVAL from the kernel when trying to
do a redacted send with a bookmark that is not a redaction bookmark.
Move redacted bookmark validation into libzfs.
Check if the bookmark given for redactions is actually a redaction
bookmark. Print an error message and exit gracefully if it is not.
Don't abort on EINVAL in zfs_send_one.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10138
Changed interval value type from decimal to integer,
because of deprecation warning in Python 3.8 and above.
Also changed kstat values type from decimal to integer,
because all the values are integers.
Fixed behavior of arcstat when run without args.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Bartosz Zieba <bartosz@zieba.pro>
Closes#10132Closes#10142
If a has rollback has occurred while a file is open and unlinked.
Then when the file is closed post rollback it will not exist in the
rolled back version of the unlinked object. Therefore, the call to
zap_remove_int() may correctly return ENOENT and should be allowed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6812Closes#9739
Fix minor cstyle warnings accidentally introduced by 7145123b.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10143
This change adds a separate return code to zfs_ioc_recv that is used
for incomplete streams, in addition to the existing return code for
streams that contain corruption.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10122
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.
By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup #9749Closes#10029
Currently when the dataset is in use we can't receive snapshots.
zfs send test/1@asd | zfs recv -FM test/2
cannot unmount '/test/2': Device busy
This commits add option 'M' which attempts to forcibly unmount the
dataset. Thanks to this we can enforce receiving snapshots in a
single step.
Note that this functionality is not supported on Linux because the
VFS will prevent active mounted filesystems from being unmounted,
even with the force option. This is the intended VFS behavior.
Test cases were added to verify the expected behavior based on
the platform.
Discussed-with: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
External-issue: https://reviews.freebsd.org/D22306Closes#9904
And add log_pass appropriately.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10136
While #10121 did fix the signal numbers for FreeBSD/Darwin, it
incorrectly changed the expected encoding of exit status for commands
that exited on a signal. The encoding 256+signum is a feature of the
shell. Only the signal numbers themselves are platform-dependent.
Always use the encoding 256+signum when checking exit status for
signal exits.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10137
UINT64_MAX is not exactly representable as a double.
The closest representation is UINT64_MAX + 1, so we can use a >=
comparison instead of > for the bounds check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10127
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object. If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant. For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).
To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all. By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).
This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10099
The default options are reasonable for all of the CI builders.
* TEST_XFSTESTS_SKIP=yes - This is already the default.
* TEST_ZTEST_TIMEOUT=3600 - Increased ztest run time only increases
code coverage by a small degree. Default 900s runs are sufficient.
* Disabling certain tests on 32-bit builders is no longer needed.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10129
Some tests which pass on FreeBSD but fail on Linux had been put in the
"maybe" set. Move these back to "known" under an "if Linux" check so
the expected outcome is clear.
Add some tests that have been found to be flaky on FreeBSD stable/12
to the "maybe" set.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10120
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9662Closes#10115
Attempt to run scrub or resilver on a new pool containing only special
allocations (special vdev added on creation) caused infinite loop
because of dsl_scan_should_clear() limiting memory usage to 5% of pool
size, which it calculated accounting only normal allocation class.
Addition of special and just in case dedup classes fixes the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#10106Closes#8694
Different operating systems encode exit status in different ways.
The logapi shell library assumes the Solaris meaning of exit codes,
which is not correct on other platforms.
Define the needed constants according to the platform we are running
on and use those to decode process exit status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10121
Issue #9142 describes an error in the checks for device removal that
can prevent removal of special allocation class vdevs in some
situations.
Enhance alloc_class/alloc_class_012_pos to check situations where this
bug occurs.
Update zts-report with knowledge of issue #9142.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10116
Issue #9142
Cleanup for write_dirs involves destroying a dataset filling a pool
and then recreating the dataset for the next test. Due to the
asynchronous nature of free space accounting, recreating the dataset
can fail for lack of space, causing problems for the next test.
Add wait_freeing $TESTPOOL to wait for the space to be freed and then
sync_pool $TESTPOOL to update the space accounting before attempting
to recreate the test filesystem.
Only use a single disk to create the pool. Make it a small file so it
does not take too long to fill.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10112
dnode_special_close() waits for the refcount of dn_holds to go to zero
without holding the dn_mtx. dnode_rele_and_unlock() does the final
remove to dn_holds with dn_mtx being held:
refs = zfs_refcount_remove(&dn->dn_holds, tag);
mutex_exit(&dn->dn_mtx);
So, there is a race condition after the remove until dn_mtx is
dropped. During that time, dnode_destroy() can get called, which ends
up in dnode_dest() calling mutex_destroy() and a panic since the lock
is still held.
This change adds a condvar to wait for the final dnode_rele_and_unlock()
to release the dn_mtx before calling dnode_destroy().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes#7814Closes#10101
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:
__schedule at ffffffff81610f2e
schedule at ffffffff81611558
schedule_preempt_disabled at ffffffff8161184a
__mutex_lock at ffffffff816131e8
arc_buf_destroy at ffffffffa0bf37d7 [zfs]
dbuf_destroy at ffffffffa0bfa6fe [zfs]
dbuf_evict_one at ffffffffa0bfaa96 [zfs]
dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
osd_object_delete at ffffffffa0b64ecc [osd_zfs]
lu_object_free at ffffffffa06d6a74 [obdclass]
lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
shrink_slab at ffffffff811ca9d8
shrink_node at ffffffff811cfd94
do_try_to_free_pages at ffffffff811cfe63
try_to_free_pages at ffffffff811d01c4
__alloc_pages_slowpath at ffffffff811be7f2
__alloc_pages_nodemask at ffffffff811bf3ed
new_slab at ffffffff81226304
___slab_alloc at ffffffff812272ab
__slab_alloc at ffffffff8122740c
kmem_cache_alloc at ffffffff81227578
spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
arc_read at ffffffffa0bf0924 [zfs]
dbuf_read at ffffffffa0bf9083 [zfs]
dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes#9987
Commit 54007c79 introduced an error, changing the final
argument to $ZDB from ztest to $ZTEST. This argument
indicates the pool name, not the script, and so should
not have been changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#10118
Tests that get killed do not have an opportunity to clean up.
There are many bad states this can leave the system in, but of
particular gravity is when zinject has been used to induce bad
behavior for one or more of the test disks.
Create a failsafe mechanism in test-runner.py that runs a callback
script after every test. The script is common to all tests so all
tests benefit from the protection.
Add an obligatory `zinject -c all` to clear all zinject state after
every test case is run.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10096
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10067
Don't echo the results of arithmetic expressions, it's not necessary.
Use hw.clockrate sysctl to get CPU freq instead of parsing dmesg.boot
for a line that might not even be there anymore.
Reduce bookkeeping in fill_fs, making it easier to follow.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10113
This fixes a bug where the generated zfs-functions was being included
along with original zfs-functions.in in the make dist tarball. This
caused an unfortunate series of events during build/packaging that
resulted in the RPM-installed /etc/zfs/zfs-functions listing the
paths as:
ZFS="/usr/local/sbin/zfs"
ZED="/usr/local/sbin/zed"
ZPOOL="/usr/local/sbin/zpool"
When they should have been:
ZFS="/sbin/zfs"
ZED="/sbin/zed"
ZPOOL="/sbin/zpool"
This affects init.d (non-systemd) distros like CentOS 6.
/etc/default/zfs and /etc/zfs/zfs-functions are also used by the
initramfs, so they need to be built even when init.d support is not.
They have been moved to the (new) etc/default and (existing) etc/zfs
source directories, respectively.
Fixes: #9443
Co-authored-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Previously the generated keyload units for encryption roots with
keylocation=file://* didn't contain the code to detect if the key
was already loaded and would be marked failed in such situations.
Move the code to check whether the key is already loaded
from keylocation=prompt handling to general key loading code.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes#10103