Commit Graph

853 Commits

Author SHA1 Message Date
Ryan Moeller
088f97b921 Canonicalize Python shebangs
/usr/bin/env python3 is the suggested[1] shebang for Python in general
(likewise for python2) and is conventional across platforms. This eases
development on systems where python is not installed in /usr/bin
(FreeBSD for example) and makes it possible to develop in virtual
environments (venv) for isolating dependencies.

Many packaging guidelines discourage the use of /usr/bin/env, but since
this is the canonical way of writing shebangs in the Python community,
many packaging scripts are already equipped to handle substituting the
appropriate absolute path to python automatically.

Some RPM package builders lacking brp-mangle-shebangs need a small
fallback mechanism in the package spec to stamp the appropriate shebang
on installed Python scripts.

[1]: https://docs.python.org/3/using/unix.html?#miscellaneous

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9314
2020-01-22 13:49:00 -08:00
Andrea Gelmini
eaf8e3b779 Fix typos in cmd/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9234
2020-01-22 13:48:58 -08:00
George Wilson
628fd31d26 spa_load_verify() may consume too much memory
When a pool is imported it will scan the pool to verify the integrity
of the data and metadata. The amount it scans will depend on the
import flags provided. On systems with small amounts of memory or
when importing a pool from the crash kernel, it's possible for
spa_load_verify to issue too many I/Os that it consumes all the memory
of the system resulting in an OOM message or a hang.

To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
2020-01-22 13:48:57 -08:00
Mike Gerdts
f3f46b0e45 OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocks
When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Mike Gerdts <mike.gerdts@joyent.com>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0
Closes #8973
2020-01-22 13:48:56 -08:00
Matthew Ahrens
bee5738f77 make zil max block size tunable
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations.  The large allocations are for ZIL blocks.  If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more.  In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
2020-01-22 13:48:56 -08:00
loli10K
146d7d8846 Fix zpool subcommands error message with some unsupported options
Both 'detach' and 'online' zpool subcommands, when provided with an
unsupported option, forget to print it in the error message:

   # zpool online -t rpool vda3
   invalid option ''
   usage:
      online [-e] <pool> <device> ...

This changes fixes the error message in order to include the actual
option that is not supported.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9270
2019-09-25 11:27:51 -07:00
Pavel Zakharov
5acba22ec0 zvol_wait script should ignore partially received zvols
Partially received zvols won't have links in /dev/zvol.

Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Closes #9260
2019-09-25 11:27:51 -07:00
Pavel Zakharov
38528476bf New service that waits on zvol links to be created
The zfs-volume-wait.service scans existing zvols and waits for their
links under /dev to be created. Any service that depends on zvol
links to be there should add a dependency on zfs-volumes.target.
By default, this target is not enabled.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pzakharov@delphix.com>
Closes #8975
2019-09-25 11:27:51 -07:00
Matthew Ahrens
1f5979d23f zed crashes when devid not present
zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The
gs_devid is NULL, but the nvl has a "devid" entry.

zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is
present in nvl, but then later it and zfs_agent_iter_vdev() assume that
DEV_IDENTIFIER is present and thus gs_devid is set.

Typically this is not a problem because usually either all vdevs have
devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if
the vdev has devid before dereferencing gs_devid, the problem isn't
typically encountered. However, if some vdevs have devid's and some do
not, then the problem is easily reproduced.  This can happen if the pool
has been moved from a system that has devid's to one that does not.

The fix is for zfs_agent_iter_vdev() to only try to match the devid's if
both nvl and gsp have devid's present.

Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65090
Closes #9054
Closes #9060
2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos
1c4b0fc745 Race condition between spa async threads and export
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9015
Closes #9044
2019-09-25 11:27:50 -07:00
Antonio Russo
af7a5672c3 systemd encryption key support
Modify zfs-mount-generator to produce a dependency on new
zfs-import-key-*.service units, dynamically created at boot to call
zfs load-key for the encryption root, before attempting to mount any
encrypted datasets.

These units are created by zfs-mount-generator, and RequiresMountsFor on
the keyfile, if present, or call systemd-ask-password if a passphrase is
requested.

This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and
@rlaager, as well an adaptation of @rlaager's script to retry on
incorrect password entry.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #8750
Closes #8848
2019-09-25 11:27:49 -07:00
Brian Behlendorf
cc7fe8a599 Fix out-of-tree build failures
Resolve the incorrect use of srcdir and builddir references for
various files in the build system.  These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.

With this change it's again possible to build in a subdirectory.

    $ mkdir obj
    $ cd obj
    $ ../configure
    $ make

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8921
Closes #8943
2019-09-25 11:27:48 -07:00
Don Brady
95fcb04215 Let zfs mount all tolerate in-progress mounts
The zfs-mount service can unexpectedly fail to start when zfs
encounters a mount that is in progress. This service uses
zfs mount -a, which has a window between the time it checks if
the dataset was mounted and when the actual mount (via mount.zfs
binary) occurs.

The reason for the racing mounts is that both zfs-mount.target
and zfs-share.target are allowed to execute concurrently after
the import.  This is more of an issue with the relatively recent
addition of parallel mounting, and we should consider serializing
the mount and share targets.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8881
2019-09-25 11:27:48 -07:00
Allan Jude
d053481523 zstreamdump: add per-record-type counters and an overhead counter
Count the bytes of payload for each replication record type

Count the bytes of overhead (replication records themselves)

Include these counters in the output summary at the end of the run.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Sponsored-By: Klara Systems and Catalogic
Closes #8432
2019-09-25 11:27:48 -07:00
Michael Niewöhner
2087b6cf49 Fix memory leak in check_disk()
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #8897
Closes #8911
2019-09-25 11:27:47 -07:00
Josh Soref
6ce10fdabb grammar: it is / plural agreement
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
Closes #8818
2019-09-25 11:27:46 -07:00
Ryan Moeller
90d8067a77 Update comments to match code
s/get_vdev_spec/make_root_vdev

The former doesn't exist anymore.

Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8759
2019-09-25 11:27:46 -07:00
Eli Schwartz
ba505f90d8 arc_summary: prefer python3 version and install when there is no python
This matches the behavior of other python scripts, such as arcstat and
dbufstat, which are always installed but whose install-exec-hook actions
will simply touch up the shebang if a python interpreter was configured
*and* that interpreter is a python2 interpreter.

Fixes installation in a minimal build chroot without python available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
Closes #8851
2019-06-11 10:45:02 -07:00
Ryan Moeller
d6920fb996 Make Python detection optional and more portable
Previously, --without-python would cause ./configure to fail. Now it is
able to proceed, and the Python scripts will not be built.

Use portable parameter expansion matching instead of nonstandard
substring matching to detect the Python version.  This test is
duplicated in several places, so define a function for it.

Don't assume the full path to binaries, since different platforms do
install things in different places.  Use AC_CHECK_PROGS instead.

When building without Python, also build without pyzfs.

Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Eli Schwartz <eschwartz93@gmail.com>
Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8809 
Closes #8731
2019-06-07 12:45:40 -07:00
loli10K
51de7ccb42 Endless loop in zpool_do_remove() on platforms with unsigned char
On systems where "char" is an unsigned type the value returned by
getopt() will never be negative (-1), leading to an endless loop:
this issue prevents both 'zpool remove' and 'zstreamdump' for
working on some systems.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8789
2019-06-07 12:45:40 -07:00
Tom Caputi
b746c397e3 Disable parallel processing for 'zfs mount -l'
Currently, 'zfs mount -a' will always attempt to parallelize
work related to mounting as best it can. Unfortunately, when
the user passes the '-l' option to load keys, this causes
all threads to prompt the user for their keys at once,
resulting in a confusing and racy user experience. This patch
simply disables parallel mounting when using the '-l' flag.

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8762 
Closes #8811
2019-06-07 12:39:13 -07:00
loli10K
df717bb835 zpool: status -t is not documented in help message
This commit adds the undocumented "-t" option to zpool(8) help message.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8782
2019-06-07 12:39:13 -07:00
loli10K
cd75d5f710 zfs: missing newline character in zfs_do_channel_program() error message
This commit simply adds a missing newline ("\n") character to the error
message printed by the zfs command when the provided pool parameter
can't be found.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8783
2019-06-07 12:39:13 -07:00
loli10K
abe267f677 zpool: trim -p is not a valid option
This commit removes the documented but not handled "-p" option from
zpool(8) help message.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8781
2019-06-07 12:39:13 -07:00
Alexander Motin
4bb40c1c82 Fix dataset name comparison in zfs_compare()
The code never returned match comparing two datasets (not snapshots).
As result, uu_avl_find(), called from zfs_callback(), never succeeded,
allowing to add same dataset into the list multiple times, for example:

	# zfs get name pers pers pers@z pers@z
	NAME    PROPERTY  VALUE   SOURCE
	pers    name      pers    -
	pers    name      pers    -
	pers@z  name      pers@z  -

With the patch:

	# zfs get name pers pers pers@z pers@z
	NAME    PROPERTY  VALUE   SOURCE
	pers    name      pers    -
	pers@z  name      pers@z  -

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #8723
2019-05-08 16:42:39 -07:00
JMoVS
e2dddb7e58 Fix typesetting of Errata #4
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Justin Scholz <git@justinscholz.de>
Closes #8712 
Closes #8721
2019-05-08 16:04:45 -07:00
JMoVS
b3b60984ee Clearer wording on Errata #4
Users of existing pools, especially pools with top-level encrypted 
datasets, could run into trouble trying to work around Errata #4. 
Clarify that removing encrypted snapshots and bookmarks is enough
to clear the errata.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Justin Scholz <git@justinscholz.de>
Closes #8682 
Closes #8683
2019-05-02 16:52:57 -07:00
Tom Caputi
85bdc68401 Fix estimated scrub completion time
Currently, it is possible for the 'zpool scrub' command to
progress slightly beyond 100% due to concurrent changes
happening on the live pool. This behavior is expected, but
the userspace code for 'zpool status' would subtract the
expected amount of data from the amount of data already
scrubbed, resulting in a negative integer being casted to a
large positive one. This number was then used to calculate
the estimated completion time, resulting in wildly wrong
results. This code changes the behavior so that 'zpool status'
does not attempt to report an estimate during this period.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8611 
Closes #8687
2019-05-01 17:34:24 -07:00
Tomohiro Kusumi
2a15c00f89 Use sigaction(2) instead of sigignore(3) for portability
sigignore(3) isn't portable.
This code fails to compile on platforms without sigignore(3).
Use sigaction(2).

--
zfs_main.c: In function 'zfs_do_diff':
zfs_main.c:7178:9: error: implicit declaration of function 'sigignore' [-Werror=implicit-function-declaration]
  (void) sigignore(SIGPIPE);
         ^~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8593
2019-04-30 20:46:15 -07:00
TerraTech
50478c6dad Add option [-V|--version] to emit version string
Add the 'zfs version' and 'zpool version' subcommands to display
the version of the user space utilities and loaded zfs kernel
module.  For example:

$ zfs version
zfs-0.8.0-rc3_169_g67e0366b88
zfs-kmod-0.8.0-rc3_169_g67e0366b88

The '-V' and '--version' aliases were added to support the
common convention of using 'zfs --version` to obtain the version
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: TerraTech <1118433+TerraTech@users.noreply.github.com>
Closes #2501
Closes #8567
2019-04-16 12:24:06 -07:00
Tom Caputi
7dcd318832 Cleanup nits from ab7615d92
This patch simply up cleans up a nit and corrects an error message
issue that were introduced in the Multiple DVA scrub patch.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8619
2019-04-14 11:03:06 -07:00
Brian Behlendorf
c375c69eca
Fix 'zfs list -t snapshot' depth
Commit df583073 introduced the ability to list the snapshots for a
specified dataset.  This change inadvertently resulted in only the top-
level snapshots being listed when no dataset was specified.  Fix this
issue by adding an additional check to determine if a dataset was
provided to avoid incorrectly restricting the depth.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8591 
Closes #8594
2019-04-08 09:14:45 -07:00
Sara Hartse
a887d653b3 Restrict kstats and print real pointers
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8467 
Closes #8476
2019-04-04 18:57:06 -07:00
Tom Caputi
df583073eb Do not iterate through filesystems unnecessarily
Currently, when attempting to list snapshots ZFS may do a lot of
extra work checking child datasets. This is because the code does
not realize that it will not be able to reach any snapshots
contained within snapshots that are at the depth limit since the
snapshots of those datasets are counted as an additional layer
deeper. This patch corrects this issue.

In addition, this patch adds the ability to do perform the commands:

$ zfs list -t snapshot <dataset>
$ zfs get -t snapshot <prop> <dataset>

as a convenient way to list out properties of all snapshots of a
given dataset without having to use the depth limit.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8539
2019-04-01 11:58:59 -07:00
Brian Behlendorf
1b939560be
Add TRIM support
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
2019-03-29 09:13:20 -07:00
Olaf Faaland
060f0226e6 MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 12:47:57 -07:00
Brian Behlendorf
066da71e7f
Improve zpool labelclear
1) As implemented the `zpool labelclear` command overwrites
the calculated offsets of all four vdev labels even when only a
single valid label is found.  If the device as been re-purposed
but still contains a valid label this can result in space no
longer owned by ZFS being zeroed.  Prevent this by verifying
every label removed is intact before it's overwritten.

2) Address a small bug in zpool_do_labelclear() which prevented
labelclear from working on file vdevs.  Only block devices support
BLKFLSBUF, try the ioctl() but when it's reported as unsupported
this should not be fatal.

3) Fix `zpool labelclear` so it can be run on vdevs which were
removed from the pool with `zpool remove`.  Additionally, allow
intact but partial labels to be cleared as in the case of a failed
`zpool attach` or `zpool replace`.

4) Remove LABELCLEAR and LABELREAD variables for test cases.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8500 
Closes #8373 
Closes #6261
2019-03-21 10:13:01 -07:00
Tom Caputi
ab7615d92c Multiple DVA Scrubbing Fix
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.

This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.

To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8453
2019-03-15 14:14:31 -07:00
Tom Caputi
eaed840542 Better user experience for errata 4
This patch attempts to address some user concerns that have arisen
since errata 4 was introduced.

* The errata warning has been made less scary for users without
  any encrypted datasets.

* The errata warning now clears itself without a pool reimport if
  the bookmark_v2 feature is enabled and no encrypted datasets
  exist.

* It is no longer possible to create new encrypted datasets without
  enabling the bookmark_v2 feature, thus helping to ensure that the
  errata is resolved.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue ##8308
Closes #8504
2019-03-14 16:48:30 -07:00
kpande
98310e5d1a Update commented zed.rc values to defaults
Update zed.rc values reflect their default value.  This helps
avoid confusion if a user expects functionality to be enabled.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #8498
2019-03-14 09:53:34 -07:00
Tom Caputi
5cc9ba5cf0 Make zstreamdump -v more greppable
Currently, the verbose output of zstreamdump includes new line
characters within some individual records. Presumably, this was
originally done to keep the output from getting too wide to fit
on a terminal. However, since new flags and struct members have
been added, these rules have not been maintained consistently. In
addition, these newlines can make it hard to grep the output in
some scenarios. This patch simply removes these newlines, making
the output easier to grep and removing the inconsistency.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8493
2019-03-13 11:19:23 -07:00
Tom Caputi
f00ab3f22c Detect and prevent mixed raw and non-raw sends
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.

This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.

This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8308
2019-03-13 11:00:43 -07:00
jwittlincohen
146bdc414c Fix typo in arc_summary3
This is a simple fix for a typo ("perfetch" rather than "prefetch")
in arc_summary3.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Jason Cohen <jwittlincohen@gmail.com>
Closes #8499
2019-03-13 10:43:55 -07:00
Alek P
4c0883fb4a Avoid retrieving unused snapshot props
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077
2019-03-12 13:13:22 -07:00
Olaf Faaland
8133679ff0 Do not resume a pool if multihost is enabled
When multihost is enabled, and a pool is suspended, return
EINVAL in response to "zpool clear <pool>".  The pool
may have been imported on another host while I/O was suspended.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6933 
Closes #8460
2019-02-28 17:56:19 -08:00
Allan Jude
d6838ae649 zstreamdump: include embedded writes when dumping raw data (-d)
When feeding a replication stream to `zstreamdump -d` (raw dump mode),
it does not print the raw data for DRR_WRITE_EMBEDDED records.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #8430
2019-02-27 17:55:25 -08:00
Matthew Ahrens
c568ab8d99 zfs.8 has wrong description of "zfs program -t"
The "-t" argument to "zfs program" specifies a limit on the number of
LUA instructions that can be executed.  The zfs.8 manpage has the wrong
description.  It should be updated to match what's in zfs-program.8

Also fix the formatting of the zfs help message.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8410
2019-02-26 11:15:28 -08:00
Paul Zuchowski
9c5e88b1de zfs should optionally send holds
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7513
2019-02-15 12:41:38 -08:00
Igor K
cf89a4ec9d zdb: replace label_t to zdb_label_t for reduce collisions
with builds on illumos based platform we can see build issue
because label_t has been redefined.

for reduce build issues on others platforms we should rename
label_t to zdb_label_t.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8397
2019-02-13 11:28:36 -08:00
Serapheim Dimitropoulos
425d3237ee Get rid of space_map_update() for ms_synced_length
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8328
2019-02-12 10:38:11 -08:00