Commit Graph

695 Commits

Author SHA1 Message Date
Allan Jude
cf2667759f
zfs userspace: use zfs_path_to_zhandle so argument can be a path
Change zfs userspace subcommand to use zfs_path_to_zhandle() so that
the provided dataset can be a path (/usr) or a dataset (rpool/usr).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #8915
2020-09-25 14:37:10 -07:00
George Amanakis
37b00fb039
Update documentation of l2arc_mfuonly
with regard to evicted_l2_eligibile_mru. Even if l2arc_mfuonly is
enabled, this is not reflected in evicted_l2_eligible_mru as this
information is useful for deciding whether to toggle l2arc_mfuonly
depending on the current workload.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10945
2020-09-21 09:26:24 -07:00
George Wilson
c494aa7f57
vdev_ashift should only be set once
== Motivation and Context

The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.

In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.

== Description

This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes #10932
2020-09-18 12:13:47 -07:00
Ryan Moeller
7ead2be3d2
Rename acltype=posixacl to acltype=posix
Prefer acltype=off|posix, retaining the old names as aliases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10918
2020-09-16 12:26:06 -07:00
Georgy Yakovlev
9cc177baa0
cmd/zgenhostid: replace with simple c implementation
It was discovered that dracut scripts and zgenhostid
always generate little-endian /etc/hostid.

This commit provides simple endianess-aware binary
and updates the scripts to use it.

New features include:
 -f flag to force overwrite.
 -o flag to write to different file (for dracut)
 accepting both 0x01234567 and 01234567 values as input

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Closes #10887 
Closes #10925
2020-09-16 12:25:12 -07:00
наб
5797d7cc4c
zpoolprops.8: fix raidz par[i]ty typo
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #10923
2020-09-15 15:43:42 -07:00
George Amanakis
085321621e
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.

This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.

L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.

When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.

Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.

Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
  an ARC buffer. However, prefetches may be issued in a way that
  arc_read_done() is bypassed. Instead, move the related code in
  l2arc_write_eligible() to account for those cases too.

Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 10:10:44 -07:00
George Amanakis
feb3a7eef1
Introduce ZFS module parameter l2arc_mfuonly
In certain workloads it may be beneficial to reduce wear of L2ARC
devices by not caching MRU metadata and data into L2ARC. This commit
introduces a new tunable l2arc_mfuonly for this purpose.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10710
2020-09-08 11:44:37 -07:00
Ryan Moeller
3ec88ab205 man: Cross-reference zfs-load-key(8) for ENCRYPTION mention
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Submitted-by: Harry Schmalzbauer
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10866
2020-09-04 10:53:59 -07:00
Ryan Moeller
782b1c1249 man: Add zfs rename -r to zfs-rename(8) SYNOPSIS
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10866
2020-09-04 10:53:22 -07:00
Don Brady
4f07282786
Avoid posting duplicate zpool events
Duplicate io and checksum ereport events can misrepresent that 
things are worse than they seem. Ideally the zpool events and the 
corresponding vdev stat error counts in a zpool status should be 
for unique errors -- not the same error being counted over and over. 
This can be demonstrated in a simple example. With a single bad 
block in a datafile and just 5 reads of the file we end up with a 
degraded vdev, even though there is only one unique error in the pool.

The proposed solution to the above issue, is to eliminate duplicates 
when posting events and when updating vdev error stats. We now save 
recent error events of interest when posting events so that we can 
easily check for duplicates when posting an error. 

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #10861
2020-09-04 10:34:28 -07:00
Ryan Moeller
7b4e27232d
Add 'zfs rename -u' to rename without remounting
Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' property set to
'legacy' or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even
mounted back.

This introduces layering violation, as we need to update
'f_mntfromname' field in statfs structure related to mountpoint (for
the dataset we are renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even
cleaner way - in ZFS-only configuration root file system is ZFS file
system with 'mountpoint' property set to 'legacy'. If root dataset is
named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone
it (system/oldrootfs), update FreeBSD and if it doesn't boot we can
boot back from system/oldrootfs and rename it back to system/rootfs
while it is mounted as /. Before it was not possible, because
unmounting / was not possible.

Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Matt Macy <mmacy@freebsd.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10839
2020-09-01 16:14:16 -07:00
John-Mark Gurney
770269ef3a
Add the Xr's to the SEE ALSO as well
There are a ton of zfs-* and zpool-* man pages. This adds them to 
the SEE ALSO section so that people can more quickly look through 
what all the options are, now that the pages have been split.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: John-Mark Gurney <jmg@funkthat.com>
Closes #10589
2020-08-26 22:29:00 -07:00
Alexander Motin
523e1295fe
Introduce limit on size of L2ARC headers
Since L2ARC buffers are not evicted on memory pressure, too large
amount of headers on system with irrationally large L2ARC can render
it slow or even unusable.  This change limits L2ARC writes and
rebuild if unevictable L2ARC-only headers reach dangerous level.

While there, call arc_adapt() on L2ARC rebuild, so that it could
properly grow arc_c, reflecting potentially significant ARC size
increase and avoiding slow growth with hopeless eviction attempts
later when "overflow" is detected.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #10765
2020-08-25 14:33:36 -07:00
Ryan Moeller
b596585fd9
man: Canonicalize .TH usage
* Use all caps for document title.
* Remove section name as it can be inferred from the section number.
* Name "OpenZFS" as the document source.
* Bump modification date.

While here, fixed trailing whitespace reported by igor.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10792
2020-08-24 21:25:28 -07:00
Chris McDonough
07ce8961e5
Improve documentation of zpool import -d/-c vs -s
Specify that, by default, zpool import uses the libblkid
cache on Linux and geom on FreeBSD, and only scans when
-d/-s is provided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Chris McDonough <chrism@plope.com>
Closes #7656
Closes #10771
2020-08-23 21:18:30 -07:00
Ryan Moeller
6fe3498ca3
Import vdev ashift optimization from FreeBSD
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

 1. Existing pools created with devices that have different logical
    and physical block sizes, but were configured to use the logical
    block size (e.g. because the OS version used for pool construction
    reported the logical block size instead of the physical block
    size) will suddenly find that the vdev allocation size has
    increased. This can be easily tolerated for active members of
    the array, but ZFS would prevent replacement of a vdev with
    another identical device because it now appears that the smaller
    allocation size required by the pool is not supported by the new
    device.

 2. The device's physical block size may be too large to be supported
    by ZFS. The optimal allocation size for the vdev may be quite
    large. For example, a RAID controller may export a vdev that
    requires read-modify-write cycles unless accessed using 64k
    aligned/sized requests. ZFS currently has an 8k minimum block
    size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Macy <mmacy@freebsd.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10619
2020-08-21 12:53:17 -07:00
Ryan Moeller
6706552ea6
Remove hard coded "Linux" OS from manpages
The recommended practice for `.Os` on FreeBSD is to not specify any
arguments.  The correct OS name is used automatically.

Oddly enough, on the Linux distro I tested this on (CentOS 7), the man
pager defaulted to displaying "BSD" as the OS rather than "Linux".  To
accommodate this, tack " Linux" back on in an install hook on Linux.
This is much simpler than removing it for FreeBSD when vendored in the
base system.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10760
2020-08-21 11:55:47 -07:00
Michael Niewöhner
10b3c7f5e4 Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:

- zstd: A basic ZStandard compression algorithm Available compression.
  Levels for zstd are zstd-1 through zstd-19, where the compression
  increases with every level, but speed decreases.

- zstd-fast: A faster version of the ZStandard compression algorithm
  zstd-fast is basically a "negative" level of zstd. The compression
  decreases with every level, but speed increases.

  Available compression levels for zstd-fast:
   - zstd-fast-1 through zstd-fast-10
   - zstd-fast-20 through zstd-fast-100 (in increments of 10)
   - zstd-fast-500 and zstd-fast-1000

For more information check the man page.

Implementation details:

Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.

The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers).  The upper bits are used to store the compression level.

It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.

All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables.  Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).

The userspace tools all use the combined/bit-shifted value.

Additional notes:

zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.

ZSTD is included with all current tests and new tests are added
as-needed.

Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born.  This is currently only used by zstd but can be
extended as needed.

Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-20 10:30:06 -07:00
Ryan Moeller
009cc8e884
Make zc_nvlist_src_size limit tunable
We limit the size of nvlists passed to the kernel so a user cannot make
the kernel do an unreasonably large allocation.  On FreeBSD this limit
was 128 kiB, which turns out to be a bit too small when doing some
operations involving a large number of datasets or snapshots, for
example replication.

Make this limit tunable, with a platform-specific auto default.
Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the
system limit on user wired memory, which allows it to scale depending
on system configuration.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Issue #6572 
Closes #10706
2020-08-18 09:33:55 -07:00
Richard Laager
eaa25f1a8e
Remove GRUB restrictions
The GRUB restrictions are based around the pool's bootfs property.
Given the current situation where GRUB is not staying current with
OpenZFS pool features, having either a non-ZFS /boot or a separate
pool with limited features are pretty much the only long-term answers
for GRUB support.  Only the second case matters in this context.  For
the restrictions to be useful, the bootfs property would have to be set
on the boot pool, because that is where we need the restrictions, as
that is the pool that GRUB reads from. The documentation for bootfs
describes it as pointing to the root pool. That's also how it's used in
the initramfs. ZFS does not allow setting bootfs to point to a dataset
in another pool. (If it did, it'd be difficult-to-impossible to enforce
these restrictions cross-pool). Accordingly, bootfs is pretty much
useless for GRUB scenarios moving forward.

Even for users who have only one pool, the existing restrictions for
GRUB are incomplete. They don't prevent you from enabling the
unsupported checksums, for example. For that reason, I have ripped out
all the GRUB restrictions.

A little longer-term, I think extending the proposed features=portable
system to define a features=grub is a much more useful approach. The
user could set that on the boot pool at creation, and things would
Just Work.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8627
2020-08-17 23:12:39 -07:00
Matthew Ahrens
994de7e4b7
Remove KMC_KMEM and KMC_VMEM
`KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented
caches are `KMC_KVMEM`.

KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we
don't use kmalloc to back the SPL caches, instead we use kvmalloc
(KMC_KVMEM).  The flag, module parameter, /proc entries, and associated
code are removed.

KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to
vmalloc().  The flag, /proc entries, and associated code are removed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10673
2020-08-17 16:04:28 -07:00
Matthew Ahrens
3442c2a02d
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-07-31 21:10:52 -07:00
Ryan Moeller
8348fac30c
Limit dbuf cache sizes based only on ARC target size by default
Set the initial max sizes to ULONG_MAX to allow the caches to grow
with the ARC.

Recalculate the metadata cache size on demand so it can adapt, too.

Update descriptions in zfs-module-parameters(5).

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10563 
Closes #10610
2020-07-24 20:38:48 -07:00
tony-zfs
02fced3067
Add support to decode a resume token
Adding a new subcommand to zstream called token. This
now allows users to decode a resume token to retrieve the toname
field. This can be useful for tools that need this information.
The syntax works as follows zstream token <resume_token>.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Tony Perkins <tperkins@datto.com>
Closes #10558
2020-07-23 17:44:03 -07:00
Matthew Ahrens
6774931dfa
Extend zdb to print inconsistencies in livelists and metaslabs
Livelists and spacemaps are data structures that are logs of allocations
and frees.  Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.

These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).

ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab.  However, it
generally halts processing when the error is detected.

When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.

The specific checks added are:

Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]

Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
  spacemap log, then report any double FREEs and double ALLOCs

Verify that livelists are consistenet with spacemaps.  The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes #10515
2020-07-14 17:51:05 -07:00
Arvind Sankar
38e2e9ce83 Centralize variable substitution
A bunch of places need to edit files to incorporate the configured paths
i.e. bindir, sbindir etc. Move this logic into a common file.

Create arc_summary by copying arc_summary[23] as appropriate at build
time instead of install time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10559
2020-07-14 17:33:44 -07:00
Brian Behlendorf
9a49d3f3d3
Add device rebuild feature
The device_rebuild feature enables sequential reconstruction when
resilvering.  Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices.  However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.

The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.

    zpool attach -s <pool> <existing vdev> <new vdev>
    zpool replace -s <pool> <old vdev> <new vdev>

The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.

The `zpool wait -t resilver` command was extended to wait on
sequential resilvers.  From this perspective they are no different
than healing resilvers.

Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.

As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory.  Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.

Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10349
2020-07-03 11:05:50 -07:00
Matthew Ahrens
7b232e9354
arcstat: add 'avail', fix 'free'
The meaning of the `free` field is currently `zfs_arc_sys_free`, which
is the target amount of memory to leave free for the system, and is
constant after booting.

This commit changes the meaning of `free` to arc_free_memory(), the
amount of memory that the ARC considers to be free.

It also adds a new arcstat field `avail`, which tracks
`arc_available_memory()`.

Since `avail` can be negative, it also updates the arcstat script to
pretty-print negative values.

example output:

  $ arcstat -f time,miss,arcsz,c,grow,need,free,avail 1
      time  miss  arcsz     c  grow  need  free  avail
  15:03:02   39K   114G  114G     0     0  2.4G  407M
  15:03:03   42K   114G  114G     0     0  2.1G  120M
  15:03:04   40K   114G  114G     0     0  1.8G  -177M
  15:03:05   24K   113G  112G     0     0  1.7G  -269M
  15:03:06   29K   111G  110G     0     0  1.6G  -385M
  15:03:07   27K   110G  108G     0     0  1.4G  -535M
  15:03:08   13K   108G  108G     0     0  2.2G  239M
  15:03:09   33K   107G  107G     0     0  1.3G  -639M
  15:03:10   16K   105G  102G     0     0  2.6G  704M
  15:03:11  7.2K   102G  102G     0     0  5.1G  3.1G
  15:03:12   42K   103G  102G     0     0  4.8G  2.8G

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10494
2020-06-26 18:05:28 -07:00
Arvind Sankar
6b99fc0620 Fixes for make dist
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.

Add some files that were missing altogether in various Makefile's.

The changes to disted files in this commit (excluding deleted files):

+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10501
2020-06-26 14:20:02 -07:00
Jorgen Lundman
68301ba20e
zed additional features
This commit adds two features to zed, that macOS desires. The first
is that when you unload the kernel module, zed would enter into a
cpubusy loop calling zfs_events_next() repeatedly. We now look for
ENODEV, returned by kernel, so zed can exit gracefully.

Second feature is -I (idle) (alas -P persist was taken) is for the
deamon to;

1; if started without ZFS kernel module, stick around waiting for it.
2; if kernel module is unloaded, go back to 1.

This is due to daemons in macOS is started by launchctl, and is
expected to stick around.

Currently, the busy loop only exists when errno is ENODEV. This is
to ensure that functionality that upstream expects is not changed.
It did not care about errors before, and it still does not. (with the
exception of ENODEV).

However, it is probably better that all errors
(ERESTART notwithstanding) exits the loop, and the issues complaining
about zed taking all CPU will go away.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10476
2020-06-22 09:53:34 -07:00
adilger
f734301d22
linux: add basic fallocate(mode=0/2) compatibility
Implement semi-compatible functionality for mode=0 (preallocation)
and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL.

Since ZFS does COW and snapshots, preallocating blocks for a file
cannot guarantee that writes to the file will not run out of space.
Even if the first overwrite was guaranteed, it would not handle any
later overwrite of blocks due to COW, so strict compliance is futile.
Instead, make a best-effort check that at least enough free space is
currently available in the pool (with a bit of margin), then create
a sparse file of the requested size and continue on with life.

This does not handle all cases (e.g. several fallocate() calls before
writing into the files when the filesystem is nearly full), which
would require a more complex mechanism to be implemented, probably
based on a modified version of dmu_prealloc(), but is usable as-is.

A new module option zfs_fallocate_reserve_percent is used to control
the reserve margin for any single fallocate call.  By default, this
is 110% of the requested preallocation size, so an additional 10% of
available space is reserved for overhead to allow the application a
good chance of finishing the write when the fallocate() succeeds.
If the heuristics of this basic fallocate implementation are not
desirable, the old non-functional behavior of returning EOPNOTSUPP
for calls can be restored by setting zfs_fallocate_reserve_percent=0.

The parameter of zfs_statvfs() is changed to take an inode instead
of a dentry, since no dentry is available in zfs_fallocate_common().

A few tests from @behlendorf cover basic fallocate functionality.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Issue #326
Closes #10408
2020-06-18 11:22:11 -07:00
Grischa Zengel
059f7c20e3
man.8: Add bookmark to list of types
While checking bash_completion I missed bookmark as type.

```
# zfs get type zpool2#b
NAME      PROPERTY  VALUE     SOURCE
zpool2#b  type      bookmark  -
```

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Grischa Zengel <github.zfsonlinux@zengel.info>
Closes #10419
2020-06-10 17:53:07 -07:00
Matthew Ahrens
f66434268c
Remove unnecessary references to slavery
The horrible effects of human slavery continue to impact society.  The
casual use of the term "slave" in computer software is an unnecessary
reference to a painful human experience.

This commit removes all possible references to the term "slave".

Implementation notes:

The zpool.d/slaves script is renamed to dm-deps, which uses the same
terminology as `dmsetup deps`.

References to the `/sys/class/block/$dev/slaves` directory remain.  This
directory name is determined by the Linux kernel.  Although
`dmsetup deps` provides the same information, it unfortunately requires
elevated privileges, whereas the `/sys/...` directory is world-readable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10435
2020-06-10 17:07:59 -07:00
Andrea Gelmini
dd4bc569b9
Fix typos
Correct various typos in the comments and tests.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #10423
2020-06-09 21:24:09 -07:00
George Amanakis
b7654bd794
Trim L2ARC
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
2020-06-09 10:15:08 -07:00
felixdoerre
501a1511ae
mount: use the mount syscall directly
Allow zfs datasets to be mounted on Linux without relying on the
invocation of an external processes.  This is the same behavior
which is implemented for FreeBSD.

Use of the libmount library was originally considered because it 
provides functionality to properly lock and update the /etc/mtab 
file.  However, these days /etc/mtab is typically a symlink to 
/proc/self/mounts so there's nothing to updated.  Therefore, we
call mount(2) directly and avoid any additional dependencies. 

If required the legacy behavior can be enabled by setting the 
ZFS_MOUNT_HELPER environment variable.  This may be needed in
environments where SELinux in enabled and the zfs binary does  
not have mount permission.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Felix Dörre <felix@dogcraft.de>
#10294
2020-05-20 18:02:41 -07:00
Paul Dagnelie
de4f06c275
Small program that converts a dataset id and an object id to a path
Small program that converts a dataset id and an object id to a path

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10204
2020-05-20 10:05:33 -07:00
AJ Jordan
5a04b17717 Fix up arcstat(1) to match our version
Turns out the illumos manpage, which is what this originates from, was
written for the original Perl version of the utility which is not the
version in the OpenZFS tree. *That* version originates from a Python
rewrite that was done for FreeNAS. So fix up the manpage to match what
we actually ship (and fix a few typos in the process).

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: AJ Jordan <alex@strugee.net>
Closes #10288
2020-05-11 16:22:48 -07:00
AJ Jordan
ac806a2559 Import the arcstat(1m) manpage from illumos
And move it from section 1m to section 1 for consistency.

Imported from illumos commit f34d737f.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: AJ Jordan <alex@strugee.net>
Closes #10288
2020-05-11 16:22:18 -07:00
George Amanakis
657fd33bcf
Improvements on persistent L2ARC
Functional changes:

We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.

1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
   reflect the total aligned size of log blocks on the device. This is
   also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
   which reflect the total number of L2ARC log blocks present on cache
   devices.  It is also reflected in the header of the cache device as
   "dh_lb_count".

In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.

Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.

If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().

If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.

In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.

Non-functional changes:

Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.

Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.

Fix inaccurate comments for example in l2arc_log_blk_restore().

Add asserts at the end in l2arc_evict() and l2arc_write_buffers().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10228
2020-05-07 16:34:03 -07:00
Paul B. Henson
7bf3e1fa0f OpenZFS 3254 - add support in zfs for aclmode=restricted
Authored-by: Paul B. Henson <henson@acm.org>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/3254
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/71dbfc287c
Closes #10266
2020-04-30 11:23:59 -07:00
Paul B. Henson
a1af567bb6 OpenZFS 742 - Resurrect the ZFS "aclmode" property OpenZFS 664 - Umask masking "deny" ACL entries OpenZFS 279 - Bug in the new ACL (post-PSARC/2010/029) semantics
Porting notes:
* Updated zfs_acl_chmod to take 'boolean_t isdir' as first parameter
  rather than 'zfsvfs_t *zfsvfs'
* zfs man pages changes mixed between zfs and new zfsprops man pages

Reviewed by: Aram Hvrneanu <aram@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Robert Gordon <rbg@openrbg.com>
Reviewed by: Mark.Maybee@oracle.com
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@nexenta.com>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/742
OpenZFS-issue: https://www.illumos.org/issues/664
OpenZFS-issue: https://www.illumos.org/issues/279
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c49ce110
Closes #10266
2020-04-30 11:22:45 -07:00
alex
47c9299fcc
zfs_create: round up volume size to multiple of bs
Round up the volume size requested in `zfs create -V size` to the next
higher multiple of the volblocksize. Updates the man page and adds a
test to verify the new behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: puffi <puffi@users.noreply.github.com>
Signed-off-by: Alex John <alex@stty.io>
Closes #8541 
Closes #10196
2020-04-24 19:04:34 -07:00
Matthew Ahrens
196bee4cfd
Remove deduplicated send/receive code
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156
Closes #10212
2020-04-23 10:06:57 -07:00
Matthew Ahrens
c618f87cd2
Add zstream redup command to convert deduplicated send streams
Deduplicated send and receive is deprecated.  To ease migration to the
new dedup-send-less world, the commit adds a `zstream redup` utility to
convert deduplicated send streams to normal streams, so that they can
continue to be received indefinitely.

The new `zstream` command also replaces the functionality of
`zstreamdump`, by way of the `zstream dump` subcommand.  The
`zstreamdump` command is replaced by a shell script which invokes
`zstream dump`.

The way that `zstream redup` works under the hood is that as we read the
send stream, we build up a hash table which maps from `<GUID, object,
offset> -> <file_offset>`.

Whenever we see a WRITE record, we add a new entry to the hash table,
which indicates where in the stream file to find the WRITE record for
this block. (The key is `drr_toguid, drr_object, drr_offset`.)

For entries other than WRITE_BYREF, we pass them through unchanged
(except for the running checksum, which is recalculated).

For WRITE_BYREF records, we change them to WRITE records.  We find the
referenced WRITE record by looking in the hash table (for the record
with key `drr_refguid, drr_refobject, drr_refoffset`), and then reading
the record header and payload from the specified offset in the stream
file.  This is why the stream can not be a pipe.  The found WRITE record
replaces the WRITE_BYREF record, with its `drr_toguid`, `drr_object`,
and `drr_offset` fields changed to be the same as the WRITE_BYREF's
(i.e. we are writing the same logical block, but with the data supplied
by the previous WRITE record).

This algorithm requires memory proportional to the number of WRITE
records (same as `zfs send -D`), but the size per WRITE record is
relatively low (40 bytes, vs. 72 for `zfs send -D`).  A 1TB send stream
with 8KB blocks (`recordsize=8k`) would use around 5GB of RAM to
"redup".

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10124 
Closes #10156
2020-04-10 10:39:55 -07:00
George Amanakis
77f6826b83
Persistent L2ARC
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #925 
Closes #1823 
Closes #2672 
Closes #3744 
Closes #9582
2020-04-10 10:33:35 -07:00
Paul Dagnelie
5a42ef04fd
Add 'zfs wait' command
Add a mechanism to wait for delete queue to drain.

When doing redacted send/recv, many workflows involve deleting files 
that contain sensitive data. Because of the way zfs handles file 
deletions, snapshots taken quickly after a rm operation can sometimes 
still contain the file in question, especially if the file is very 
large. This can result in issues for redacted send/recv users who 
expect the deleted files to be redacted in the send streams, and not 
appear in their clones.

This change duplicates much of the zpool wait related logic into a 
zfs wait command, which can be used to wait until the internal
deleteq has been drained.  Additional wait activities may be added 
in the future. 

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9707
2020-04-01 10:02:06 -07:00
Ryan Moeller
9a51738b60
Let default arc_c_max be platform dependent
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator.  Other platforms
do not require the same logic.

Implement an arc_default_max() function to determine a default max ARC
size in platform code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10155
2020-03-27 09:14:46 -07:00
Matthew Ahrens
652bdc9b0e
Deprecate deduplicated send streams
Dedup send can only deduplicate over the set of blocks in the send
command being invoked, and it does not take advantage of the dedup table
to do so. This is a very common misconception among not only users, but
developers, and makes the feature seem more useful than it is. As a
result, many users are using the feature but not getting any benefit
from it.

Dedup send requires a nontrivial expenditure of memory and CPU to
operate, especially if the dataset(s) being sent is (are) not already
using a dedup-strength checksum.

Dedup send adds developer burden. It expands the test matrix when
developing new features, causing bugs in released code, and delaying
development efforts by forcing more testing to be done.

As a result, we are deprecating the use of `zfs send -D` and receiving
of such streams.  This change adds a warning to the man page, and also
prints the warning whenever dedup send or receive are used.

In a future release, we plan to:
1. remove the kernel code for generating deduplicated streams
2. make `zfs send -D` generate regular, non-deduplicated streams
3. remove the kernel code for receiving deduplicated streams
4. make `zfs receive` of deduplicated streams process them in userland
   to "re-duplicate" them, so that they can still be received.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7887 
Closes #10117
2020-03-18 13:31:10 -07:00
Mariusz Zaborski
a57d3d45d6
Add option for forcible unmounting dataset while receiving snapshot.
Currently when the dataset is in use we can't receive snapshots.

    zfs send test/1@asd | zfs recv -FM test/2
    cannot unmount '/test/2': Device busy

This commits add option 'M' which attempts to forcibly unmount the
dataset.  Thanks to this we can enforce receiving snapshots in a
single step.

Note that this functionality is not supported on Linux because the
VFS will prevent active mounted filesystems from being unmounted,
even with the force option.  This is the intended VFS behavior.

Test cases were added to verify the expected behavior based on
the platform.

Discussed-with: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
External-issue: https://reviews.freebsd.org/D22306
Closes #9904
2020-03-17 10:08:32 -07:00
Matthew Ahrens
7261fc2e81
Improve zfs receive performance by batching writes
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object.  If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant.  For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all.  By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10099
2020-03-16 11:51:56 -07:00
Ryan Moeller
f5f6fb03b7
Change default to overlay=on
Filesystems allow overlay mounts by default on FreeBSD and Linux.

Respect the native convention by switching the default to overlay=on,
while retaining the option to turn the property off for compatibility
with other operating systems' conventions.

Update documentation and tests accordingly.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10030
2020-03-06 09:28:19 -08:00
Brian Behlendorf
2288d41968
Add trim support to zpool wait
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #10071
2020-03-04 15:07:11 -08:00
Mariusz Zaborski
89adffb7bc
Add notice that forcefully unmount is not supported on Linux
The Linux VFS will never allow a filesystem which is in use to
be unmounted.  This behavior differs from other platforms like
FreeBSD which allow a filesystem to be force unmounted.  This 
will result in errors being returned to applications actively
using the filesystem.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org>
Closes #10013
2020-02-18 13:36:23 -08:00
Richard Laager
f244846462
Prefer org.openzfs for features and properties
Moving forward, we wish to use org.openzfs (no dash) rather than
org.open-zfs or org.zfsonlinux for feature GUIDs and property names.
The existing feature GUIDs cannot be changed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #10003
2020-02-18 09:36:50 -08:00
InsanePrawn
ecbbdac799 Systemd mount generator: Generate noauto units; add control properties
This commit refactors the systemd mount generators and makes the
following major changes:

- The generator now generates units for datasets marked canmount=noauto,
  too. These units are NOT WantedBy local-fs.target.
  If there are multiple noauto datasets for a path, no noauto unit will
  be created. Datasets with canmount=on are prioritized.

- Introduces handling of new user properties which are now included in
  the zfs-list.cache files:
    - org.openzfs.systemd:requires:
      List of units to require for this mount unit
    - org.openzfs.systemd:requires-mounts-for:
      List of mounts to require by this mount unit
    - org.openzfs.systemd:before:
      List of units to order after this mount unit
    - org.openzfs.systemd:after:
      List of units to order before this mount unit
    - org.openzfs.systemd:wanted-by:
      List of units to add a Wants dependency on this mount unit to
    - org.openzfs.systemd:required-by:
      List of units to add a Requires dependency on this mount unit to
    - org.openzfs.systemd:nofail:
      Toggles between a wants and a requires dependency.
    - org.openzfs.systemd:ignore:
      Do not generate a mount unit for this dataset.

  Consult the updated man page for detailed documentation.

- Restructures and extends the zfs-mount-generator(8) man page with the
  above properties, information on unit ordering and a license header.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: InsanePrawn <insane.prawny@gmail.com>
Closes #9649
2020-02-14 15:32:55 -08:00
Jason King
13b5a4d5c0
Support setting user properties in a channel program
This adds support for setting user properties in a
zfs channel program by adding 'zfs.sync.set_prop'
and 'zfs.check.set_prop' to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Contributions-by: Jason King <jason.king@joyent.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9950
2020-02-14 13:41:42 -08:00
Matthew Ahrens
4fe3a842bb
Remove limit on number of async zio_frees of non-dedup blocks
The module parameter zfs_async_block_max_blocks limits the number of
blocks that can be freed by the background freeing of filesystems and
snapshots (from "zfs destroy"), in one TXG.  This is useful when freeing
dedup blocks, becuase each zio_free() of a dedup block can require an
i/o to read the relevant part of the dedup table (DDT), and will also
dirty that block.

zfs_async_block_max_blocks is set to 100,000 by default.  For the more
typical case where dedup is not used, this can have a negative
performance impact on the rate of background freeing (from "zfs
destroy").  For example, with recordsize=8k, and TXG's syncing once
every 5 seconds, we can free only 160MB of data per second, which may be
much less than the rate we can write data.

This change increases zfs_async_block_max_blocks to be unlimited by
default.  To address the dedup freeing issue, a new tunable is
introduced, zfs_max_async_dedup_frees, which limits the number of
zio_free()'s of dedup blocks done by background destroys, per txg.  The
default is 100,000 free's (same as the old zfs_async_block_max_blocks
default).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10000
2020-02-14 08:39:46 -08:00
Christian Schwarz
948f0c4419 zcp: add zfs.sync.bookmark
Add support for bookmark creation and cloning.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:17 -08:00
Christian Schwarz
a73f361fdb Implement bookmark copying
This feature allows copying existing bookmarks using

    zfs bookmark fs#target fs#newbookmark

There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.

Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).

ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.

Overview:

- Terminology:
    - source = existing snapshot or bookmark
    - new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
  - create new bookmark node
  - copy source's `zbn_phys` to new's `zbn_phys`
  - zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
  bookmarks as sources
  - => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
  and bookmark sources
- Adjust CLI
  - refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
  - warn about redaction bookmark handling
- Add test cases
  - CLI
  - pyyzfs libzfs_core bindings

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:12 -08:00
Paul Zuchowski
bc67cba7c0
Fix zdb -R with 'b' flag
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display.  Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9640
Closes #9729
2020-02-10 14:00:05 -08:00
Attila Fülöp
31b160f0a6
ICP: Improve AES-GCM performance
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9749
2020-02-10 12:59:50 -08:00
Ryan Moeller
8c4987c489
Restore aclmode and remove acltype on FreeBSD
This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE,
matching what is done in the NFSv4 ACLs PR (#9709).

On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide
ZFS_PROP_ACLMODE.

The tests already assume this arrangement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9913
2020-02-04 08:40:07 -08:00
Ned Bass
a3403164d7 zdb: add support for object ranges for zdb -d
Allow a range of object identifiers to dump with -d. This may
be useful when dumping a large dataset and you want to break
it up into multiple phases, or to resume where a previous scan
left off. Object type selection flags are supported to reduce
the performance overhead of verbosely dumping unwanted objects,
and to reduce the amount of post-processing work needed to
filter out unwanted objects from zdb output.

This change extends existing syntax in a backward-compatible
way. That is, the base case of a range is to specify a single
object identifier to dump. Ranges and object identifiers can
be intermixed as command line parameters.

Usage synopsis:

    Object ranges take the form <start>:<end>[:<flags>]
        start    Starting object number
        end      Ending object number, or -1 for no upper bound
        flags    Optional flags to select object types:
         A    All objects (this is the default)
         d    ZFS directories
         f    ZFS files
         m    SPA space maps
         z    ZAPs
         -    Negate effect of next flag

Examples:

 # Dump all file objects
 zdb -dd tank/fish 0👎f

 # Dump all file and directory objects
 zdb -dd tank/fish 0👎fd

 # Dump all types except file and directory objects
 zdb -dd tank/fish 0👎A-f-d

 # Dump object IDs in a specific range
 zdb -dd tank/fish 1000:2000

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #9832
2020-01-24 11:00:46 -08:00
Romain Dolbeau
35b07497c6 Add AltiVec RAID-Z
Implements the RAID-Z function using AltiVec SIMD.
This is basically the NEON code translated to AltiVec.

Note that the 'fletcher' algorithm requires 64-bits
operations, and the initial implementations of AltiVec
(PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to
32-bits operations, so no 'fletcher'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes #9539
2020-01-23 11:01:24 -08:00
Jason King
e2ef1cbf04 Support inheriting properties in channel programs
This adds support in channel programs to inherit properties analogous
to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit`
functions to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9738
2020-01-22 17:03:17 -08:00
Paul Zuchowski
f12e42cccf zdb -d should accept the numeric objset id
As an alternative to the dataset name, zdb now allows the decimal 
or hexadecimal objset ID to be specified.  When permanent errors
are reported as 2 hexadecimal numbers (objset ID : object ID) in 
zpool status; you can now use 'zdb <pool>[/objset ID] object' to
determine the names of the objset and object which have the error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9733
2020-01-16 09:22:49 -08:00
Richard Laager
f744f36ce5 Document zfs change-key caveats
As discussed on the 2019-01-07 OpenZFS Leadership Meeting, we need to be
clear about the limitations of `zfs change-key`.  Changing the user key
does not change the master key, nor does it currently overwrite the old
wrapped master key on disk.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #9819
2020-01-14 10:11:07 -08:00
Brian Behlendorf
e458fcca75
Change http://zfsonlinux.org links to https://zfsonlinux.org
Update the project website links contained in to repository to
reference the secure https://zfsonlinux.org address.

Reviewed-By: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9837
2020-01-13 16:43:59 -08:00
Tom Caputi
ba0ba69e50 Add 'zfs send --saved' flag
This commit adds the --saved (-S) to the 'zfs send' command.
This flag allows a user to send a partially received dataset,
which can be useful when migrating a backup server to new
hardware. This flag is compatible with resumable receives, so
even if the saved send is interrupted, it can be resumed.
The flag does not require any user / kernel ABI changes or any
new feature flags in the send stream format.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Christian Schwarz <me@cschwarz.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9007
2020-01-10 10:16:58 -08:00
DeHackEd
67709516db zfs-module-parameters(5): add all missing items
I ran a report against the output of `modinfo zfs.ko`. This commit adds
everything missing and corrects a few renamed module parameters.
Specifically:

* zfs_checksums_per second renamed in ad796b8a3
* vdev_ms_count_limit renamed in c853f382d

Also fixes some variable type inconsistencies (unsigned int => uint)

Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #9809
2020-01-07 12:41:53 -08:00
Brian Behlendorf
33dc49e0c0
Fix ZPOOL_VDEV_NAME_PATH option description
The corresponding zpool status option is -P and not -p.  Update
this description to reference the correct option.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9803
2020-01-06 10:43:32 -08:00
Tony Hutter
9fb2771aa5 Colorize zpool status output
If the ZFS_COLOR env variable is set, then use ANSI color
output in zpool status:

- Column headers are bold
- Degraded or offline pools/vdevs are yellow
- Non-zero error counters and faulted vdevs/pools are red
- The 'status:' and 'action:' sections are yellow if they're
  displaying a warning.

This also includes a new 'faketty' function in libtest.shlib that is
compatible with FreeBSD (code provided by @freqlabs).

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9340
2019-12-19 16:26:07 -08:00
Matthew Macy
4bc721965f Add FreeBSD jail support hooks
Add the 'zfs jail/unjail' subcommands along with the relevant 
documentation from FreeBSD.  This feature is not supported on
Linux and still requires the match kernel ioctls which will
be included when the FreeBSD platform code is integrated.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9686
2019-12-11 11:58:37 -08:00
Paul Zuchowski
f0bf435176 zio_decompress_data always ASSERTs successful decompression
This interferes with zdb_read_block trying all the decompression
algorithms when the 'd' flag is specified, as some are
expected to fail.  Also control the output when guessing
algorithms, try the more common compression types first, allow
specifying lsize/psize, and fix an uninitialized variable.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9612 
Closes #9630
2019-12-10 15:51:58 -08:00
Matthew Macy
f95704ca5e Disable EDONR on FreeBSD
FreeBSD uses its own crypto framework in-kernel which, at this time,
has no EDONR implementation.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9664
2019-12-05 13:10:29 -08:00
Brian Behlendorf
624222ae31
Increase allowed 'special_small_blocks' maximum value
There may be circumstances where it's desirable that all blocks
in a specified dataset be stored on the special device.  Relax
the artificial 128K limit and allow the special_small_blocks
property to be set up to 1M.  When blocks >1MB have been enabled
via the zfs_max_recordsize module option, this limit is increased
accordingly.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9131
Closes #9355
2019-12-03 09:58:03 -08:00
Brian Behlendorf
9e17e6f254
Remove zfs_vdev_elevator module option
As described in commit f81d5ef6 the zfs_vdev_elevator module
option is being removed.  Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9417
Closes #9609
2019-11-27 10:35:49 -08:00
Paul Zuchowski
894f6696b4 Add display of checksums to zdb -R
The function zdb_read_block (zdb -R) was always intended to have a :c 
flag which would read the DVA and length supplied by the user, and 
display the checksum. Since we don't know which checksum goes with 
the data, we should calculate and display them all.

For each checksum in the table, read in the data at the supplied 
DVA:length, calculate the checksum, and display it. Update the man 
page and create a zfs test for the new feature.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9607
2019-11-27 10:08:18 -08:00
Kjeld Schouten
64c77c4bad Change zed.service to zfs-zed.service in man page
zed.service does not exist
replaced with correct service name in man.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Closes #9581
2019-11-13 10:23:23 -08:00
Ross Williams
c5ebfbbe19 Reorganize zpool(8) man page into sections
Moved subcommand topics into individual manpages. Reordered and 
grouped the list of subcommands by topic.

Moved concepts overview to `zpoolconcepts.8` and the long list of
available pool properties to `zpoolprops.8`.

Internal cross-references copied from `zpool.8` needed to be 
converted to `.Xr` external references to new subcommand manual 
pages.

Move `autotrim` into lexical order, autotrim tacked onto the end
of a list. Now it is in alphabetical order.

Clarify attach/detach description. Description was too specific to
command syntax. Overview clarifies reason for attaching or detaching
a device.

Clarify replace description, don't refer to subcommand arguments.

Clarify split command description, say what split actually does and
why you'd want to do it.

Clarify description of upgrade, and simplify the zpool.8 wording of
the zpool-upgrade(8) description.

Clarify description of import, detail what zpool-import(8) actually 
does.

Add appropriate SEE ALSO sections. Divided zpool subcommand manual 
pages need their own SEE ALSO sections. Also modified fsck.zfs.8 
to point directly to zfs-scrub.8 and zed.8.in to include a direct
reference to zfs-events.8

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ross Williams <ross@ross-williams.net>
Closes #9564
2019-11-13 09:21:07 -08:00
Ross Williams
870fc32fc9 Reorganize zfs(8) man page into sections
Most subcommands got their own manpages (e.g. create). Some related
commands grouped into a single manpage and symlinks created (e.g. set, 
get, and inherit). I did this when topics were either too short to 
warrant their own file or so interrelated that a user would want to
refer between commands in the same file.

Corrected .Sx internal references to .Xr cross refs; lots of .Sx 
references from when text was all in zfs.8 needed to be changed to 
.Xr zfs-$SUBCOMMAND 8 cross references.

Divided subcommand list in zfs(8) into sections of related 
functionality. This required writing new descriptions for some 
commands.

Preserved ".Os Linux", `.Os` macro parsing behavior differs between 
mandoc from the "BSD" mandoc package (available on Ubuntu) and man 
from Ubuntu's man-db package, which calls groff to format the manpages.

Groff handles the `.Os` macro differently and wrongly, defaulting 
it to "BSD" in `/usr/share/groff/*/tmac/mdoc/doc-common`, instead of
getting the default from `uname`.

A future set of changes will introduce build-time preprocessing of
manpages for platform-specific documentation and can insert the
correct operating system name.

Added SEE ALSO sections, the newly-divided zfs-*.8 subcommand man
pages needed their own SEE ALSO sections pointing to related
subcommands and, in some cases, documentation from other packages
(e.g. zfs-share.8).

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ross Williams <ross@ross-williams.net>
Closes #9559
2019-11-12 11:17:40 -08:00
Romain Dolbeau
0b2a642351 Add AVX512BW variant of fletcher
It is much faster than AVX512F when byteswapping on Skylake-SP
and newer, as we can do the byteswap in a single vshufb instead
of many instructions.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #9517
2019-10-30 12:26:14 -07:00
Brian Behlendorf
d292307212
Modify sharenfs=on default behavior
While it may sometimes be convenient to export an NFS filesystem with
no_root_squash it should not be the default behavior.  Align the
default behavior with the Linux NFS server defaults.  To restore
the previous behavior use 'zfs set sharenfs="no_root_squash,..."'.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9397
Closes #9425
2019-10-13 19:13:26 -07:00
Richard Yao
803884217f Implement ZPOOL_IMPORT_UDEV_TIMEOUT_MS
Since 0.7.0, zpool import would unconditionally block on udev for 30
seconds. This introduced a regression in initramfs environments that
lack udev (particularly mdev based environments), yet use a zfs userland
tools intended for the system that had been built against udev. Gentoo's
genkernel is the main example, although custom user initramfs
environments would be similarly impacted unless special builds of the
ZFS userland utilities were done for them.  Such environments already
have their own mechanisms for blocking until device nodes are ready
(such as genkernel's scandelay parameter), so it is unnecessary for
zpool import to block on a non-existent udev until a timeout is reached
inside of them.

Rather than trying to intelligently determine whether udev is available
on the system to avoid unnecessarily blocking in such environments, it
seems best to just allow the environment to override the timeout.  I
propose that we add an environment variable called
ZPOOL_IMPORT_UDEV_TIMEOUT_MS. Setting it to 0 would restore the 0.6.x
behavior that was more desirable in mdev based initramfs environments.
This allows the system user land utilities to be reused when building
mdev-based initramfs archives.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #9436
2019-10-11 09:52:48 -07:00
Brian Behlendorf
f3dc4a85e9
Update zfs program command usage
Update the zfs(8) man page to clearly describe that arguments for
channel programs are to be listed after the -- sentinel which
terminates argument processing.  This behavior is supported by
getopt on Linux, FreeBSD, and Illumos according to each platforms
respective man pages.

    zfs program [-jn] [-t instruction-limit] [-m memory-limit]
        pool script [--] arg1 ...

Reviewed-by: Clint Armstrong <clint@clintarmstrong.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9056 
Closes #9428
2019-10-10 09:49:45 -07:00
Brian Behlendorf
f81d5ef686
Add warning for zfs_vdev_elevator option removal
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d4188.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9317
2019-09-25 09:23:29 -07:00
John Gallagher
e60e158eff Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.

This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:

 - Scrubs or resilvers to complete
 - Devices to initialized
 - Devices to be replaced
 - Devices to be removed
 - Checkpoints to be discarded
 - Background freeing to complete

For example, a scrub that is in progress could be waited for by running

    zpool wait -t scrub <pool>

This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.

This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.

Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:

 - Added ZoL-style ioctl input declaration.
 - Reorganized error handling in zpool_initialize in libzfs to integrate
   better with changes made for TRIM support.
 - Fixed check for whether a checkpoint discard is in progress.
   Previously it also waited if the pool had a checkpoint, instead of
   just if a checkpoint was being discarded.
 - Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
 - Updated more existing tests to make use of new 'zpool wait'
   functionality, tests that don't exist in Delphix OS.
 - Used existing ZoL tunable zfs_scan_suspend_progress, together with
   zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
 - Added support for a non-integral interval argument to zpool wait.

Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-13 18:09:06 -07:00
Andrea Gelmini
ac3d4d0cf6 Fix typos in man/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9233
2019-08-30 09:41:35 -07:00
Paul Dagnelie
eef0f4d84e Keep more metaslabs loaded
With the other metaslab changes loaded onto a system, we can 
significantly reduce the memory usage of each loaded metaslab and 
unload them on demand if there is memory pressure. However, none 
of those changes actually result in us keeping more metaslabs loaded. 
If we don't keep more metaslabs loaded, we will still have to wait 
for demand-loading to finish when no loaded metaslab can satisfy our 
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while 
unloading is currently done on a txg-count basis. Txgs can take a 
widely varying range of times, from tenths of a second to several 
seconds. This can result in confusing, hard to predict behavior.

This change simply adds a time-based component to metaslab unloading. 
A metaslab will remain loaded for one minute and 8 txgs (by default) 
after it was last used, unless it is evicted due to memory pressure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-65016
External-issue: DLPX-65047
Closes #9197
2019-08-29 10:20:36 -07:00
Paul Dagnelie
f09fda5071 Cap metaslab memory usage
On systems with large amounts of storage and high fragmentation, a huge 
amount of space can be used by storing metaslab range trees. Since 
metaslabs are only unloaded during a txg sync, and only if they have 
been inactive for 8 txgs, it is possible to get into a state where all 
of the system's memory is consumed by range trees and metaslabs, and 
txgs cannot sync. While ZFS knows how to evict ARC data when needed, 
it has no such mechanism for range tree data. This can result in boot 
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing 
context. Second, we store a multilist of all loaded metaslabs, sorted 
by their selection txg, so we can quickly identify the oldest 
metaslabs.  We use a multilist to reduce lock contention during heavy 
write workloads. Finally, we add logic that will unload a metaslab 
when we're loading a new metaslab, if we're using more than a certain 
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9128
2019-08-16 09:08:21 -06:00
George Wilson
c8242a96ba spa_load_verify() may consume too much memory
When a pool is imported it will scan the pool to verify the integrity 
of the data and metadata. The amount it scans will depend on the 
import flags provided. On systems with small amounts of memory or 
when importing a pool from the crash kernel, it's possible for 
spa_load_verify to issue too many I/Os that it consumes all the memory 
of the system resulting in an OOM message or a hang.

To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
2019-08-13 08:11:57 -06:00
Serapheim Dimitropoulos
3b9edd7b17 Introduce getting holds and listing bookmarks through ZCP
Consumers of ZFS Channel Programs can now list bookmarks,
and get holds from datasets. A minor-refactoring was also
applied to distinguish between user and system properties
in ZCP.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Ported-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Dan Kimmel <dan.kimmel@delphix.com>

OpenZFS-issue: https://illumos.org/issues/8862
Closes #7902
2019-08-12 10:02:34 -07:00
Paul Dagnelie
c81f1790e2 Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
2019-08-05 14:34:27 -07:00
Serapheim Dimitropoulos
cae97c88c4 List log_spacemap feature in zpool-features.5 manual
Update zpool-features.5 manpage to describe the log_spacemap feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9096
2019-07-31 09:29:01 -07:00
Sara Hartse
37f03da8ba Fast Clone Deletion
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.

This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.

We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8416
2019-07-26 10:54:14 -07:00
Pavel Zakharov
26b6047469 New service that waits on zvol links to be created
The zfs-volume-wait.service scans existing zvols and waits for their
links under /dev to be created. Any service that depends on zvol
links to be there should add a dependency on zfs-volumes.target.
By default, this target is not enabled.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pzakharov@delphix.com>
Closes #8975
2019-07-17 15:33:05 -07:00
Mike Gerdts
d45d7f08fa Add zfs create dryrun
Adds the ability to sanity check zfs create arguments and to see the
value of any additional properties that will local to the dataset.  For
example, automation that may need to adjust quota on a parent filesystem
before creating a volume may call `zfs create -nP -V <size> <volume>` to
obtain the value of refreservation.  This adds the following options to
zfs create:

- -n dry-run (no-op)
- -v verbose
- -P parseable (implies verbose)

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Mike Gerdts <mike.gerdts@joyent.com>
Closes #8974
2019-07-16 11:19:24 -07:00
Serapheim Dimitropoulos
93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00