Commit Graph

4537 Commits

Author SHA1 Message Date
Matthew Ahrens
67c0f0dedc
ARC shrinking blocks reads/writes
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5527, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5527.

Note: commit 3ec34e5527 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes #10496
2020-06-26 10:42:27 -07:00
Arvind Sankar
0b03254830 Fix tags targets in module/Makefile.in + cleanup
These targets look to have been copied from an automake-generated
Makefile.in, and can't work since none of the auto-generated automake
variables are defined here.

Moreover, ctags has been overridden in the top-level Makefile, so the
target is pointless anyway, and gtags is not a recursive target.

Fix cscopelist by moving it to the top-level Makefile as well, in line
with ctags and etags.

Also, add -a to ctags command as well, otherwise it won't work if more
than one xargs invocation takes place.

Add assembler files to ctags/etags, prune all dotted-dirs, and restrict
the find to files only.

Cleanup: add .PHONY to module/Makefile.in, and fix one recipe with a
missing continuation character.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10493
2020-06-24 18:19:44 -07:00
Arvind Sankar
109d2c9310 Move zfs_gitrev.h to build directory
Currently an out-of-tree build does not work with read-only source
directory because zfs_gitrev.h can't be created. Move this file to the
build directory, which is more appropriate for a generated file, and
drop the dist-hook for zfs_gitrev.h. There is no need to distribute this
file since it will be regenerated as part of the compilation in any
case.

scripts/make_gitrev.sh tries to avoid updating zfs_gitrev.h if there has
been no change, however this doesn't cover the case when the source
directory is not in git: in that case zfs_gitrev.h gets overwritten even
though it's always "unknown". Simplify the logic to always write out a
new version of zfs_gitrev.h, compare against the old and overwrite only
if different. This is now simple enough to just include in the
Makefile, so drop the script.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10493
2020-06-24 18:19:28 -07:00
Arvind Sankar
33982eb24c Support out-of-tree kmod build on FreeBSD
If srcdir != builddir, pass down MAKEOBJDIR to the FreeBSD make to
support out-of-tree builds.

Also allow passing all the gmake options that FreeBSD make understands
to support useful flags like -k, -n, -q etc, and detect the number of
CPUs if -j was specified without an argument.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10493
2020-06-24 18:18:41 -07:00
Ryan Moeller
9192f27c1d
Add zfs_multihost_interval tunable handler for FreeBSD
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.

Add the handler so the tunable can be declared in common code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10490
2020-06-23 13:32:42 -07:00
Matthew Ahrens
540493ba4f
Clarify comments in config/*.m4, vdev_geom.c, zfs_allow_*.ksh
Rephrase comments to be more clear.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10481
2020-06-22 09:46:37 -07:00
Ryan Moeller
1c08fa8b5b
Fix copy-paste error breaking FreeBSD head
Resolve the FreeBSD head build failure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10480
2020-06-19 15:12:34 -07:00
Ryan Moeller
2e6af52b2e
Match new vfs_checkexp KPI in FreeBSD head
KPI changed in FreeBSD, update accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10475
2020-06-18 13:45:36 -07:00
Arvind Sankar
ae7b167a98 Enable -Wmissing-prototypes/-Wstrict-prototypes
Switch on warning flags to detect mismatch between declaration and
definition.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:53 -07:00
Arvind Sankar
c0673571d0 Switch off -Wmissing-prototypes for libgcc math functions
spl-generic.c defines some of the libgcc integer library functions on
32-bit. Don't bother checking -Wmissing-prototypes since nothing should
directly call these functions from C code.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:46 -07:00
Arvind Sankar
eebba5d8f4 Make Skein_{Get,Put}64_LSB_First inline functions
Turn the generic versions into inline functions and avoid
SKEIN_PORT_CODE trickery.

Also drop the PLATFORM_MUST_ALIGN check for using the fast bcopy
variants. bcopy doesn't assume alignment, and the userspace version is
currently different because the _ALIGNMENT_REQUIRED macro is only
defined by the kernelspace headers.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:38 -07:00
Arvind Sankar
0ce2de637b Add prototypes
Add prototypes/move prototypes to header files.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:32 -07:00
Arvind Sankar
60356b1a21 Add include files for prototypes
Include the header with prototypes in the file that provides definitions
as well, to catch any mismatch between prototype and definition.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:25 -07:00
Arvind Sankar
c3fe42aabd Remove dead code
Delete unused functions.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:18 -07:00
Arvind Sankar
65c7cc49bf Mark functions as static
Mark functions used only in the same translation unit as static. This
only includes functions that do not have a prototype in a header file
either.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:20:38 -07:00
adilger
f734301d22
linux: add basic fallocate(mode=0/2) compatibility
Implement semi-compatible functionality for mode=0 (preallocation)
and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL.

Since ZFS does COW and snapshots, preallocating blocks for a file
cannot guarantee that writes to the file will not run out of space.
Even if the first overwrite was guaranteed, it would not handle any
later overwrite of blocks due to COW, so strict compliance is futile.
Instead, make a best-effort check that at least enough free space is
currently available in the pool (with a bit of margin), then create
a sparse file of the requested size and continue on with life.

This does not handle all cases (e.g. several fallocate() calls before
writing into the files when the filesystem is nearly full), which
would require a more complex mechanism to be implemented, probably
based on a modified version of dmu_prealloc(), but is usable as-is.

A new module option zfs_fallocate_reserve_percent is used to control
the reserve margin for any single fallocate call.  By default, this
is 110% of the requested preallocation size, so an additional 10% of
available space is reserved for overhead to allow the application a
good chance of finishing the write when the fallocate() succeeds.
If the heuristics of this basic fallocate implementation are not
desirable, the old non-functional behavior of returning EOPNOTSUPP
for calls can be restored by setting zfs_fallocate_reserve_percent=0.

The parameter of zfs_statvfs() is changed to take an inode instead
of a dentry, since no dentry is available in zfs_fallocate_common().

A few tests from @behlendorf cover basic fallocate functionality.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Issue #326
Closes #10408
2020-06-18 11:22:11 -07:00
Matthew Macy
8056a75672
Disambiguate condvar API contract
On Illumos callers of cv_timedwait and cv_timedwait_hires
can't distinguish between whether or not the cv was signaled
or the call timed out. Illumos handles this (for some definition
of handles) by calling cv_signal in the return path if we were
signaled but the return value indicates instead that we timed
out. This would make sense if it were possible to query the the
cv for its net signal disposition. However, this isn't possible
and, in spite of the fact that there are places in the code that
clearly take a different and incompatible path if a timeout value
is indicated, this distinction appears to be rather subtle to most
developers. This problem is further compounded by the fact that on
Linux, calling cv_signal in the return path wouldn't even do the
right thing unless there are other waiters.

Since it is possible for the caller to independently determine how
much time is remaining but it is not possible to query if the cv
was in fact signaled, prioritizing signalling over timeout seems
like a cleaner solution. In addition, judging from usage patterns
within the code itself, it is also less error prone.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10471
2020-06-18 10:17:50 -07:00
Matthew Macy
7564073ed6
Add abd_cache_reap_now for abd_chunk_cache users
Apparently missed in the initial port integration was
the need to reap the abd_chunk_cache on FreeBSD. This
change addresses that oversight.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10474
2020-06-17 21:44:13 -07:00
Jorgen Lundman
4458157bee
zfs_ioctl: saved_poolname can be truncated
As it uses kmem_strdup() and kmem_strfree() which both rely on
strlen() being the same, but saved_poolname can be truncated causing:

SPL: kernel memory allocator:
buffer freed to wrong cache
SPL: buffer was allocated from kmem_alloc_16,
SPL: caller attempting free to kmem_alloc_8.
SPL: buffer=0xffffff90acc66a38  bufctl=0x0  cache: kmem_alloc_8

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10469
2020-06-17 14:30:03 -07:00
Alexander Motin
17ca30185a
Set initial arc_c to arc_c_min instead of arc_c_max
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10437
2020-06-17 14:27:04 -07:00
Ryan Moeller
86a0f49483
FreeBSD: Kernel module should depend on xdr not krpc after 1300092
Since https://reviews.freebsd.org/D24408 FreeBSD provides XDR functions
in the xdr module instead of krpc.

For FreeBSD 13, the MODULE_DEPEND should be changed to xdr

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10442 
Closes #10443
2020-06-16 11:47:04 -07:00
Jorgen Lundman
d366c8fd7a
Make struct vdev_disk_t be platform private
Linux defines different vdev_disk_t members to macOS, but they are
only used in vdev_disk.c so move the declaration there.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10452
2020-06-16 11:43:33 -07:00
Brian Atkinson
0a03495e3e
Fixing ABD struct allocation for FreeBSD
In the event we are allocating a gang ABD in FreeBSD we are passing 0
to abd_alloc_struct(); however, this led to an allocation of ABD scatter
with 0 chunks. This left the gang ABD allocation 24 bytes smaller than
it should have been.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10431
2020-06-16 10:05:22 -07:00
Jorgen Lundman
883a40fff4
Add convenience wrappers for common uio usage
The macOS uio struct is opaque and the API must be used, this
makes the smallest changes to the code for all platforms.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10412
2020-06-14 10:09:55 -07:00
Jorgen Lundman
4f73576ea1
Upstream: zil_commit_waiter() can stall forever
On macOS clock_t is unsigned, so when cv_timedwait_hires() returns -1
we loop forever. The conditional was tweaked to ignore signedness.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10445
2020-06-14 10:08:21 -07:00
Brian Atkinson
e08b993396
Removing ZERO_PAGE abd_alloc_zero_scatter
For MIPS architectures on Linux the ZERO_PAGE macro references
empty_zero_page, which is exported as a GPL symbol. The call to
ZERO_PAGE in abd_alloc_zero_scatter has been removed and a single
zero'd page is now allocated for each of the pages in abd_zero_scatter
in the kernel ABD code path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10428
2020-06-10 17:54:11 -07:00
Ryan Moeller
feff3f69fc
Fixup "Avoid the GEOM topology lock recursion when autoexpanding a pool"
The patch was applied to vdev_geom_open instead of vdev_geom_close by
mistake.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10427
2020-06-10 11:05:15 -07:00
Arvind Sankar
71504277ae Cleanup linux module kbuild files
The linux module can be built either as an external module, or compiled
into the kernel, using copy-builtin. The source and build directories
are slightly different between the two cases, and currently, compiling
into the kernel still refers to some files from the configured ZFS
source tree, instead of the copies inside the kernel source tree. There
is also duplication between copy-builtin, which creates a Kbuild file to
build ZFS inside the kernel tree, and the top-level module/Makefile.in.

Fix this by moving the list of modules and the CFLAGS settings into a
new module/Kbuild.in, which will be used by the kernel kbuild
infrastructure, and using KBUILD_EXTMOD to distinguish the two cases
within the Makefiles, in order to choose appropriate include
directories etc.

Module CFLAGS setting is simplified by using subdir-ccflags-y (available
since 2.6.30) to set them in the top-level Kbuild instead of each
individual module. The disabling of -Wunused-but-set-variable is removed
from the lua and zfs modules. The variable that the Makefile uses is
actually not defined, so this has no effect; and the warning has long
been disabled by the kernel Makefile itself.

The target_cpu definition in module/{zfs,zcommon} is removed as it was
replaced by use of CONFIG_SPARC64 in
  commit 70835c5b75 ("Unify target_cpu handling")

os/linux/{spl,zfs} are removed from obj-m, as they are not modules in
themselves, but are included by the Makefile in the spl and zfs module
directories. The vestigial Makefiles in os and os/linux are removed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10379
Closes #10421
2020-06-10 09:24:15 -07:00
Andrea Gelmini
dd4bc569b9
Fix typos
Correct various typos in the comments and tests.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #10423
2020-06-09 21:24:09 -07:00
Matthew Ahrens
7bcb7f0840
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224 
Closes #10383
2020-06-09 10:41:01 -07:00
George Amanakis
b7654bd794
Trim L2ARC
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
2020-06-09 10:15:08 -07:00
Michael Niewöhner
32f26eaa70
Move GFP flags kernel compatibility code
Move the GFP flags kernel compat code from c file to kmem header.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #10424
2020-06-08 16:33:46 -07:00
Michael Niewöhner
080102a1b6
Linux 5.8 compat: __vmalloc()
The `pgprot` argument has been removed from `__vmalloc` in Linux 5.8,
being `PAGE_KERNEL` always now [1].

Detect this during configure and define a wrapper for older kernels.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/mm/vmalloc.c?h=next-20200605&id=88dca4ca5a93d2c09e5bbc6a62fbfc3af83c4fca

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #10422
2020-06-08 16:32:02 -07:00
Pawel Jakub Dawidek
529246df96
Restore support for in-kernel ZFS ioctls
In Illumos it is possible to call ioctl functions from within the
kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support
that, but it doesn't hurt to keep it around, as all the code is there.

Before this commit it was a dead code and zc_iflags was always zero.
Restore this functionality by allowing to pass a flag to the
zfsdev_ioctl_common() function.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #10417
2020-06-08 13:57:22 -07:00
Pawel Jakub Dawidek
77b998fa70
Remove redundant includes
By removing excessive includes it takes us a small step close to
compiling this file in userland.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #10415
2020-06-08 09:57:36 -07:00
Jorgen Lundman
c9e319faae
Replace sprintf()->snprintf() and strcpy()->strlcpy()
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used.  If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().

The biggest change is adding a size parameter to zfs_id_to_fuidstr().

The various *_impl_get() functions are only used on linux and have
not yet been updated.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10400
2020-06-07 11:42:12 -07:00
Ryan Moeller
60265072e0
Improve compatibility with C++ consumers
C++ is a little picky about not using keywords for names, or string
constness.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10409
2020-06-06 12:54:04 -07:00
Brian Behlendorf
63761a8f1a
zfsvfs_setup(): zap_stats_t may have undefined content when accessed (#10398)
Signed-off-by: Allan Jude <allanjude@klarasystems.com>

Co-authored-by: Allan Jude <allanjude@klarasystems.com>
2020-06-05 17:21:04 -07:00
Allan Jude
4547fc4e07
Connect dataset_kstats for FreeBSD
Expand the FreeBSD spl for kstats to support all current types

Move the dataset_kstats_t back to zvol_state_t from zfs_state_os_t
now that it is common once again

```
kstat.zfs/mypool.dataset.objset-0x10b.nunlinked: 0
kstat.zfs/mypool.dataset.objset-0x10b.nunlinks: 0
kstat.zfs/mypool.dataset.objset-0x10b.nread: 150528
kstat.zfs/mypool.dataset.objset-0x10b.reads: 48
kstat.zfs/mypool.dataset.objset-0x10b.nwritten: 134217728
kstat.zfs/mypool.dataset.objset-0x10b.writes: 1024
kstat.zfs/mypool.dataset.objset-0x10b.dataset_name: mypool/datasetname
```

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10386
2020-06-05 17:17:02 -07:00
Paul Dagnelie
99b281f1ae
Fix double mutex_init bug in send code
It was possible to cause a kernel panic in the send code by 
initializing an already-initialized mutex, if a record was created 
with type DATA, destroyed with a different type (bypassing the 
mutex_destroy call) and then re-allocated as a DATA record again.

We tweak the logic to not change the type of a record once it has 
been created, avoiding the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10374
2020-06-03 19:53:21 -07:00
Allan Jude
7da304bbce zfsvfs_setup(): zap_stats_t may have undefined content when accessed
Signed-off-by: Allan Jude <allanjude@klarasystems.com>
2020-06-03 22:20:27 +00:00
Ryan Moeller
c0eb5c35ef
FreeBSD: Simplify zvol and fix locking
zvol_geom_bio_strategy should handle its own use of the zvol
suspend reader lock and ensure the zilog exists when needed.

A few other places using the zvol zilog should use the suspend
reader lock as well.

Simplify consumers of zvol_geom_bio_strategy, fix the locking, and
while in here, use the boolean_t constants with doread.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10381
2020-06-03 10:45:12 -07:00
Ryan Moeller
a9dcfac51c
Periodically update ARC kstats
FreeBSD needs arc_adjust_zthr to run periodically for kstats to be
updated.  A comment in the code suggests this may have been the
original intent in illumos as well:

c946d5a913/module/zfs/arc.c (L4697-L4700)

Create the thread with a 1 second timer.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10371
2020-06-03 09:52:38 -07:00
Jorgen Lundman
081de9a86d
Restore avl_update() calls and related functions
The macOS kmem implementation uses avl_update() and related
functions.  These same function exist in the Solaris AVL code but
were removed because they were unused.  Restore them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10390
2020-06-03 09:49:32 -07:00
Matthew Macy
3bf3b164ee
Fix crypto build on FreeBSD HEAD
Update API usage to reflect recent change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10384
2020-05-30 12:54:57 -07:00
Jorgen Lundman
70a5fc0530
Memory leak in dsl_destroy_snapshots_nvl error case
The dsl_destroy_snapshots_nvl() function has an early error out, 
and temporary nvlists were not freed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10366
2020-05-26 16:13:41 -07:00
Brian Atkinson
fb822260b1
Gang ABD Type
Adding the gang ABD type, which allows for linear and scatter ABDs to
be chained together into a single ABD.

This can be used to avoid doing memory copies to/from ABDs. An example
of this can be found in vdev_queue.c in the vdev_queue_aggregate()
function.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian <bwa@clemson.edu>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10069
2020-05-20 18:06:09 -07:00
DeHackEd
57434abae6
Use boot_ncpus in place of max_ncpus in taskq_create
Due to hotplug support or BIOS bugs sometimes max_ncpus can be
an absurdly high value. I have a system with 32 cores/threads
but reports max_ncpus == 440. This many threads potentially
cripples the system during arc_prune floods for example.

boot_ncpus is the number of working CPUs when called so use
that instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #10282
2020-05-20 10:07:21 -07:00
Ryan Moeller
7b0e39030c
freebsd: Correct the order of arguments to copyin() for Q_SETQUOTA
Sponsored by: DARPA
External-issue: https://reviews.freebsd.org/D24656
FreeBSD-commit: freebsd/freebsd@a431c095d3

Authored by: jhb <jhb@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10344
2020-05-19 16:45:25 -07:00
Kyle Evans
5b090d57d4
freebsd: return EISDIR for read(2) on directories
This is arguably a change for internal consistency within OpenZFS, as the
Linux implementation will reject read(2) on directories with EISDIR. It's
not unreasonable for read(2) to do something here on FreeBSD, but we don't
currently copy out anything useful anyways so start rejecting it with the
appropriate error.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #10338
2020-05-16 10:12:01 -07:00
Ryan Moeller
d2782af461
Fix ZVOL_DIR
We only use ZVOL_DIR on FreeBSD, and on FreeBSD it isn't correct.

Move the definition to the file where it is needed, and define it as
/dev/zvol/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10337
2020-05-16 10:10:38 -07:00
Matthew Ahrens
1b9cd1a9d9
Fix error handling in receive_writer_thread()
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by #10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10320
2020-05-14 20:48:29 -07:00
yparitcher
cdcce2f019
Fix VN_OPEN_INVFS typo
The VN_OPEN_INVFS literal is in the wrong field.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: yparitcher <y@paritcher.com>
Closes #10322
2020-05-14 20:47:14 -07:00
Brian Behlendorf
2ade659eb4
Fix abd_enter/exit_critical wrappers
Commit fc551d7 introduced the wrappers abd_enter_critical() and
abd_exit_critical() to mark critical sections.  On Linux these are
implemented with the local_irq_save() and local_irq_restore() macros
which set the 'flags' argument when saving.  By wrapping them with
a function the local variable is no longer set by the macro and is
no longer properly restored.

Convert abd_enter_critical() and abd_exit_critical() to macros to
resolve this issue and ensure the flags are properly restored.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10332
2020-05-14 20:45:16 -07:00
Jorgen Lundman
eeb8fae9c7
Upstream: add missing thread_exit()
Undo FreeBSD wrapper for thread_create() added to call thread_exit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10314
2020-05-14 15:58:09 -07:00
Matthew Ahrens
8b240f14f9
remove unneeded member drc_err of dmu_recv_cookie_t
The member drc_err of dmu_recv_cookie_t is used only locally in
receive_read, so we can replace it with a local variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10319
2020-05-14 12:10:29 -07:00
John Poduska
41035a0496
Resilver restarts unnecessarily when it encounters errors
When a resilver finishes, vdev_dtl_reassess is called to hopefully
excise DTL_MISSING (amongst other things). If there are errors during
the resilver, they are tracked in DTL_SCRUB, as spelled out in the
block comment in vdev.c. DTL_SCRUB is in-core only, so it can only
be used if the pool was online for the whole resilver. This state is
tracked with the spa_scrub_started flag, which only gets set when
the scan is initialized. Unfortunately, this flag gets cleared right
before vdev_dtl_reassess gets called, so if there are any errors
during the scan, DTL_MISSING will never get excised and the resilver
will just continually restart. This fix simply moves clearing that
flag until after the call to vdev_dtl_reasses.

In addition, if a pool is imported and already has scn_errors > 0,
this change will restart the resilver immediately instead of doing
the rest of the scan and then restarting it from the beginning. On
the other hand, if scn_errors == 0 at import, then no errors have
been encountered so far, so the spa_scrub_started flag can be safely
set.

A test has been added to verify that resilver does not restart when
relevant DTL's are available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #10291
2020-05-13 10:54:27 -07:00
Brian Atkinson
fc551d7efb
Combine OS-independent ABD Code into Common Source File
Reorganizing ABD code base so OS-independent ABD code has been placed
into a common abd.c file. OS-dependent ABD code has been left in each
OS's ABD source files, and these source files have been renamed to
abd_os.

The OS-independent ABD code is now under:
module/zfs/abd.c
With the OS-dependent code in:
module/os/linux/zfs/abd_os.c
module/os/freebsd/zfs/abd_os.c

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10293
2020-05-10 12:23:52 -07:00
George Amanakis
657fd33bcf
Improvements on persistent L2ARC
Functional changes:

We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.

1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
   reflect the total aligned size of log blocks on the device. This is
   also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
   which reflect the total number of L2ARC log blocks present on cache
   devices.  It is also reflected in the header of the cache device as
   "dh_lb_count".

In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.

Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.

If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().

If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.

In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.

Non-functional changes:

Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.

Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.

Fix inaccurate comments for example in l2arc_log_blk_restore().

Add asserts at the end in l2arc_evict() and l2arc_write_buffers().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10228
2020-05-07 16:34:03 -07:00
Paul Dagnelie
108a454a46
Add support for boot environment data to be stored in the label
Modern bootloaders leverage data stored in the root filesystem to 
enable some of their powerful features. GRUB specifically has a grubenv 
file which can store large amounts of configuration data that can be 
read and written at boot time and during normal operation. This allows 
sysadmins to configure useful features like automated failover after 
failed boot attempts. Unfortunately, due to the Copy-on-Write nature 
of ZFS, the standard behavior of these tools cannot handle writing to
ZFS files safely at boot time. We need an alternative way to store 
data that allows the bootloader to make changes to the data.

This work is very similar to work that was done on Illumos to enable 
similar functionality in the FreeBSD bootloader. This patch is different 
in that the data being stored is a raw grubenv file; this file can store 
arbitrary variables and values, and the scripting provided by grub is 
powerful enough that special structures are not required to implement 
advanced behavior.

We repurpose the second padding area in each label to store the grubenv 
file, protected by an embedded checksum. We add two ioctls to get and 
set this data, and libzfs_core and libzfs functions to access them more 
easily. There are no direct command line interfaces to these functions; 
these will be added directly to the bootloader utilities.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10009
2020-05-07 09:36:33 -07:00
George Amanakis
1b664952ae
Enable splitting mirrors with indirect vdevs
When a top-level vdev is removed from a pool it is converted to an
indirect vdev. Until now splitting such mirrored pools was not possible
with zpool split. This patch enables handling of indirect vdevs and
splitting of those pools with zpool split.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10283
2020-05-06 10:32:28 -07:00
Ryan Moeller
ddc7a2dd3b
taskq: Don't leak system_delay_taskq on FreeBSD
Adds a missing taskq_destroy() call.

Reported by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10292
2020-05-05 09:36:41 -07:00
Ryan Moeller
6f3e1a4828
Avoid the GEOM topology lock recursion when autoexpanding a pool
The steps to reproduce the problem:

        mdconfig -a -t swap -s 3g -u 0
        gpart create -s GPT md0
        gpart add -t freebsd-zfs -s 1g md0
        zpool create -o autoexpand=on foo md0p1
        gpart resize -i 1 -s 2g md0

Authored by: pjd <pjd@FreeBSD.org>
FreeBSD-commit: freebsd/freebsd@bccd2db598

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Ported-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10270
2020-05-04 15:10:41 -07:00
Ryan Moeller
639dfeb831
Update FreeBSD SPL atomics
Sync up with the following changes from FreeBSD:

ZFS: add emulation of atomic_swap_64 and atomic_load_64

Some 32-bit platforms do not provide 64-bit atomic operations that ZFS
requires, either in userland or at all.  We emulate those operations
for those platforms using a mutex.  That is not entirely correct and
it's very efficient.  Besides, the loads are plain loads, so torn
values are possible.

Nevertheless, the emulation seems to work for some definition of work.

This change adds atomic_swap_64, which is already used in ZFS code,
and atomic_load_64 that can be used to prevent torn reads.

Authored by: avg <avg@FreeBSD.org>
FreeBSD-commit: freebsd/freebsd@3458e5d1e6

cleanup of illumos compatibility atomics

atomic_cas_32 is implemented using atomic_fcmpset_32 on all platforms.
Ditto for atomic_cas_64 and atomic_fcmpset_64 on platforms that have
it.  The only exception is sparc64 that provides MD atomic_cas_32 and
atomic_cas_64.
This is slightly inefficient as fcmpset reports whether the operation
updated the target and that information is not needed for cas.
Nevertheless, there is less code to maintain and to add for new
platforms.  Also, the operations are done inline now as opposed to
function calls before.

atomic_add_64_nv is implemented using atomic_fetchadd_64 on platforms
that provide it.

casptr, cas32, atomic_or_8, atomic_or_8_nv are completely removed as
they have no users.

atomic_mtx that is used to emulate 64-bit atomics on platforms that
lack them is defined only on those platforms.

As a result, platform specific opensolaris_atomic.S files have lost
most of their code.  The only exception is i386 where the
compat+contrib code provides 64-bit atomics for userland use.  That
code assumes availability of cmpxchg8b instruction.  FreeBSD does not
have that assumption for i386 userland and does not provide 64-bit
atomics.  Hopefully, this can and will be fixed.

Authored by: avg <avg@FreeBSD.org>
FreeBSD-commit: freebsd/freebsd@e9642c209b

emulate illumos membar_producer with atomic_thread_fence_rel

membar_producer is supposed to be a store-store barrier.
Also, in the code that FreeBSD has ported from illumos membar_producer
is used only with regular stores to regular memory (with respect to
caching).

We do not have an MI primitive for the store-store barrier, so
atomic_thread_fence_rel is the closest we have as it provides
(load | store) -> store barrier.

Previously, membar_producer was an empty function call on all 32-bit
arm-s, 32-bit powerpc, riscv and all mips variants.  I think that it
was inadequate.
On other platforms, such as amd64, arm64, i386, powerpc64, sparc64,
membar_producer was implemented using stronger primitives than required
for a store-store barrier with respect to regular memory access.
For example, it used sfence on amd64 and lock-ed nop in i386 (despite
TSO).
On powerpc64 we now use recommended lwsync instead of eieio.
On sparc64 FreeBSD uses TSO mode.
On arm64/aarch64 we now use dmb sy instead of dmb ish.  Not sure if
this is an improvement, actually.

After this change we can drop opensolaris_atomic.S for aarch64, amd64,
powerpc64 and sparc64 as all required atomic operations have either
direct or light-weight mapping to FreeBSD native atomic operations.

Discussed with: kib
Authored by: avg <avg@FreeBSD.org>
FreeBSD-commit: freebsd/freebsd@50cdda62fc

fix up r353340, don't assume that fcmpset has strong semantics

fcmpset can have two kinds of semantics, weak and strong.
For practical purposes, strong semantics means that if fcmpset fails
then the reported current value is always different from the expected
value.  Weak semantics means that the reported current value may be the
same as the expected value even though fcmpset failed.  That's a so
called "sporadic" failure.

I originally implemented atomic_cas expecting strong semantics, but
many platforms actually have weak one.

Reported by:    pkubaj (not confirmed if same issue)
Discussed with: kib, mjg
Authored by: avg <avg@FreeBSD.org>
FreeBSD-commit: freebsd/freebsd@238787c74e

[PowerPC] [MIPS] Implement 32-bit kernel emulation of atomic64 operations

This is a lock-based emulation of 64-bit atomics for kernel use, split off
from an earlier patch by jhibbits.

This is needed to unblock future improvements that reduce the need for
locking on 64-bit platforms by using atomic updates.

The implementation allows for future integration with userland atomic64,
but as that implies going through sysarch for every use, the current
status quo of userland doing its own locking may be for the best.

Submitted by:   jhibbits (original patch), kevans (mips bits)
Reviewed by:    jhibbits, jeff, kevans
Authored by: bdragon <bdragon@FreeBSD.org>
Differential Revision:  https://reviews.freebsd.org/D22976
FreeBSD-commit: freebsd/freebsd@db39dab3a8

Remove sparc64 kernel support

Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs

Authored by: imp <imp@FreeBSD.org>
FreeBSD-commit: freebsd/freebsd@48b94864c5

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Ported-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10250
2020-05-04 15:07:04 -07:00
Paul B. Henson
0aeb0bed6f OpenZFS 6765 - zfs_zaccess_delete() comments do not accurately
reflect delete permissions for ACLs

Authored by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Paul B. Henson <henson@acm.org>

Porting Notes:
* Only comments are updated

OpenZFS-issue: https://www.illumos.org/issues/6765
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/da412744bc
Closes #10266
2020-04-30 11:24:55 -07:00
Paul B. Henson
235a856576 OpenZFS 6762 - POSIX write should imply DELETE_CHILD on directories
- and some additional considerations

Authored by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/6762
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1eb4e906ec
Closes #10266
2020-04-30 11:24:38 -07:00
Paul B. Henson
99495ba6ab OpenZFS 8984 - fix for 6764 breaks ACL inheritance
Authored by: Dominik Hassler <hadfl@omniosce.org>
Reviewed by: Sam Zaydel <szaydel@racktopsystems.com>
Reviewed by: Paul B. Henson <henson@acm.org>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/8984
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e9bacc6d1a
Closes #10266
2020-04-30 11:24:27 -07:00
Paul B. Henson
5a2f527d4b OpenZFS 6764 - zfs issues with inheritance flags during chmod(2)
with aclmode=passthrough

Authored by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/6764
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de0f1ddb59
Closes #10266
2020-04-30 11:24:14 -07:00
Paul B. Henson
7bf3e1fa0f OpenZFS 3254 - add support in zfs for aclmode=restricted
Authored-by: Paul B. Henson <henson@acm.org>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/3254
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/71dbfc287c
Closes #10266
2020-04-30 11:23:59 -07:00
Paul B. Henson
a1af567bb6 OpenZFS 742 - Resurrect the ZFS "aclmode" property OpenZFS 664 - Umask masking "deny" ACL entries OpenZFS 279 - Bug in the new ACL (post-PSARC/2010/029) semantics
Porting notes:
* Updated zfs_acl_chmod to take 'boolean_t isdir' as first parameter
  rather than 'zfsvfs_t *zfsvfs'
* zfs man pages changes mixed between zfs and new zfsprops man pages

Reviewed by: Aram Hvrneanu <aram@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Robert Gordon <rbg@openrbg.com>
Reviewed by: Mark.Maybee@oracle.com
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@nexenta.com>
Ported-by: Paul B. Henson <henson@acm.org>

OpenZFS-issue: https://www.illumos.org/issues/742
OpenZFS-issue: https://www.illumos.org/issues/664
OpenZFS-issue: https://www.illumos.org/issues/279
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c49ce110
Closes #10266
2020-04-30 11:22:45 -07:00
Brian Behlendorf
ab16c87e55
Add longjmp support for Thumb-2
When a Thumb-2 kernel is being used, then longjmp must be implemented
using the Thumb-2 instruction set in module/lua/setjmp/setjmp_arm.S.

Original-patch-by: @jsrlabs
Reviewed-by: @awehrfritz
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7408 
Closes #9957 
Closes #9967
2020-04-29 17:30:13 -07:00
George Amanakis
fa25460538
Add missing zfs_refcount_destroy() in key_mapping_rele()
Otherwise when running with reference_tracking_enable=TRUE mounting
and unmounting an encrypted dataset panics with:

Call Trace:
 dump_stack+0x66/0x90
 slab_err+0xcd/0xf2
 ? __kmalloc+0x174/0x260
 ? __kmem_cache_shutdown+0x158/0x240
 __kmem_cache_shutdown.cold+0x1d/0x115
 shutdown_cache+0x11/0x140
 kmem_cache_destroy+0x210/0x230
 spl_kmem_cache_destroy+0x122/0x3e0 [spl]
 zfs_refcount_fini+0x11/0x20 [zfs]
 spa_fini+0x4b/0x120 [zfs]
 zfs_kmod_fini+0x6b/0xa0 [zfs]
 _fini+0xa/0x68c [zfs]
 __x64_sys_delete_module+0x19c/0x2b0
 do_syscall_64+0x5b/0x1a0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10246
2020-04-28 09:53:45 -07:00
Ryan Moeller
a8085184d6
Fix zlib leak on FreeBSD
zlib_inflateEnd was accidentally a wrapper for inflateInit instead of
inflateEnd, and hilarity ensues.

Fix the typo so we free memory instead of allocating more.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10225
Closes #10252
2020-04-28 09:14:30 -07:00
Tom Caputi
aa646323db
Fix missing ivset guid with resumed raw base recv
This patch corrects a bug introduced in 61152d1069. When
resuming a raw base receive, the dmu_recv code always sets
drc->drc_fromsnapobj to the object ID of the previous
snapshot. For incrementals, this is correct, but for base
sends, this should be left at 0. The presence of this ID
eventually allows a check to run which determines whether
or not the incoming stream and the previous snapshot have
matching IVset guids. This check fails becuase it is not
meant to run when there is no previous snapshot. When it
does fail, the user receives an error stating that the
incoming stream has the problem outlined in errata 4.

This patch corrects this issue by simply ensuring
drc->drc_fromsnapobj is left as 0 for base receives.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #10234 
Closes #10239
2020-04-24 19:00:32 -07:00
Matthew Ahrens
196bee4cfd
Remove deduplicated send/receive code
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156
Closes #10212
2020-04-23 10:06:57 -07:00
Matthew Ahrens
32d805c3e2
Use a struct to organize metaslab-group-allocator fields
Each metaslab group (of which there is one per top-level vdev) has
several (4, by default) "metaslab group allocators".  Each "allocator"
has its own metaslab that it prefers to allocate from (the "primary"
allocator), and each can perform allocations concurrently with the other
allocators.  In addition to the primary metaslab, there are several
other fields that need to be tracked separately for each allocator.
These are currently stored as several arrays in the metaslab_group_t,
each array indexed by allocator number.

This change organizes all the metaslab-group-allocator-specific fields
into a new struct, metaslab_group_allocator_t.  The metaslab_group_t now
needs only one array indexed by the allocator number - which contains
the metaslab_group_allocator_t's.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10213
2020-04-22 10:26:56 -07:00
Matthew Ahrens
1f043c8be1
Fix zfs send progress reporting
The progress of a send is supposed to be reported by `zfs send -v`, but
it is not.  This works by creating a new user thread (with
pthread_create()) which does ZFS_IOC_SEND_PROGRESS ioctls to check how
much progress has been made.  This IOCTL finds the specified send (since
there may be multiple concurrent sends in the system).  The IOCTL also
checks that the specified send was started by the current process.

On Linux, different threads of the same process are represented as
different `struct task_struct`s (and, confusingly, have different
PID's).  To check if if two threads are in the same process, we need to
check if they have the same `struct task_struct:group_leader`.

We used to to this correctly, but it was inadvertently changed by
30af21b025 (Redacted Send) to simply check if the current
`struct task_struct` is the one that started the send.

This commit changes the code back to checking if the send was started by
a `struct task_struct` with the same `group_leader` as the calling
thread.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Chris Wedgwood <cw@f00f.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10215 
Closes #10216
2020-04-20 10:12:48 -07:00
Matthew Macy
c614fd6e12
Use new FreeBSD API to largely eliminate object locking
Propagate changes in HEAD that mostly eliminate object locking.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10205
2020-04-17 09:30:26 -07:00
George Amanakis
9249f1272e
Persistent L2ARC minor fixes
Minor fixes on persistent L2ARC improving code readability and fixing 
a typo in zdb.c when byte-swapping a log block. It also improves the 
pesist_l2arc_007_pos.ksh test by giving it more time to retrieve log 
blocks on the cache device.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10210
2020-04-17 09:27:40 -07:00
Ryan Moeller
a7929f3137
Update FreeBSD tunables
Remove some obsolete legacy compat, rename some misnamed, and add some
missing tunables for FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10203
2020-04-15 11:14:47 -07:00
Matthew Macy
9f0a21e641
Add FreeBSD support to OpenZFS
Add the FreeBSD platform code to the OpenZFS repository.  As of this
commit the source can be compiled and tested on FreeBSD 11 and 12.
Subsequent commits are now required to compile on FreeBSD and Linux.
Additionally, they must pass the ZFS Test Suite on FreeBSD which is
being run by the CI.  As of this commit 1230 tests pass on FreeBSD
and there are no unexpected failures.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #898 
Closes #8987
2020-04-14 11:36:28 -07:00
Brian Behlendorf
791e480c6a
Disable user space reference tracking
The memory and cpu cost of reference count tracking with the current
implementation is significant.  For this reason it has always been
disabled by default for the kmods.  Apply this same default to user
space so ztest doesn't always incur this performance penalty.

Our intention is to re-enable this by default for ztest once the code
has been optimized.  Since we expect to at some point provide a FUSE
implementation we wouldn't want this enabled by default for libzpool.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10189
2020-04-13 10:51:44 -07:00
Matthew Ahrens
20f287855a
zvol_write() can use dmu_tx_hold_write_by_dnode()
We can improve the performance of writes to zvols by using
dmu_tx_hold_write_by_dnode() instead of dmu_tx_hold_write().  This
reduces lock contention on the first block of the dnode object, and also
reduces the amount of CPU needed.  The benefit will be highest with
multi-threaded async writes (i.e. writes that don't call zil_commit()).

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10184
2020-04-10 21:14:01 -07:00
George Amanakis
77f6826b83
Persistent L2ARC
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #925 
Closes #1823 
Closes #2672 
Closes #3744 
Closes #9582
2020-04-10 10:33:35 -07:00
Ryan Moeller
36a6e2335c
Don't ignore zfs_arc_max below allmem/32
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.

Add warning messages when tunables are being ignored.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10157
Closes #10158
2020-04-09 15:39:48 -07:00
Matthew Macy
8b27e08ed8
Add separate field for indicating that spa is in middle of split
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10178
2020-04-09 09:59:31 -07:00
Brian Behlendorf
68dde63d13
Linux 5.7 compat: blk_alloc_queue()
Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified
the blk_alloc_queue() interface by updating it to take the request
queue as an argument.  Add a wrapper function which accepts the new
arguments and internally uses the available interfaces.

Other minor changes include increasing the Linux-Maximum to 5.6 now
that 5.6 has been released.  It was not bumped to 5.7 because this
release has not yet been finalized and is still subject to change.

Added local 'struct zvol_state_os *zso' variable to zvol_alloc.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10181 
Closes #10187
2020-04-09 09:16:46 -07:00
Matthew Macy
01c4f2bf29
Use vn_io_fault_uiomove on FreeBSD to avoid potential deadlock
Added to prevent a possible deadlock, the following comments from
FreeBSD explain the issue.  The comment describing vn_io_fault_uiomove:

/*
 * Helper function to perform the requested uiomove operation using
 * the held pages for io->uio_iov[0].iov_base buffer instead of
 * copyin/copyout.  Access to the pages with uiomove_fromphys()
 * instead of iov_base prevents page faults that could occur due to
 * pmap_collect() invalidating the mapping created by
 * vm_fault_quick_hold_pages(), or pageout daemon, page laundry or
 * object cleanup revoking the write access from page mappings.
 *
 * Filesystems specified MNTK_NO_IOPF shall use vn_io_fault_uiomove()
 * instead of plain uiomove().
 */

This used for vn_io_fault which has the following motivation:

/*
 * The vn_io_fault() is a wrapper around vn_read() and vn_write() to
 * prevent the following deadlock:
 *
 * Assume that the thread A reads from the vnode vp1 into userspace
 * buffer buf1 backed by the pages of vnode vp2.  If a page in buf1 is
 * currently not resident, then system ends up with the call chain
 *   vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
 *     vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
 * which establishes lock order vp1->vn_lock, then vp2->vn_lock.
 * If, at the same time, thread B reads from vnode vp2 into buffer buf2
 * backed by the pages of vnode vp1, and some page in buf2 is not
 * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
 *
 * To prevent the lock order reversal and deadlock, vn_io_fault() does
 * not allow page faults to happen during VOP_READ() or VOP_WRITE().
 * Instead, it first tries to do the whole range i/o with pagefaults
 * disabled. If all pages in the i/o buffer are resident and mapped,
 * VOP will succeed (ignoring the genuine filesystem errors).
 * Otherwise, we get back EFAULT, and vn_io_fault() falls back to do
 * i/o in chunks, with all pages in the chunk prefaulted and held
 * using vm_fault_quick_hold_pages().
 *
 * Filesystems using this deadlock avoidance scheme should use the
 * array of the held pages from uio, saved in the curthread->td_ma,
 * instead of doing uiomove().  A helper function
 * vn_io_fault_uiomove() converts uiomove request into
 * uiomove_fromphys() over td_ma array.
 *
 * Since vnode locks do not cover the whole i/o anymore, rangelocks
 * make the current i/o request atomic with respect to other i/os and
 * truncations.
 */

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10177
2020-04-08 10:30:27 -07:00
Ryan Moeller
7e3df9db12
Finish refactoring for ZFS_MODULE_PARAM_CALL
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.

To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.

With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10179
2020-04-07 10:06:22 -07:00
Paul Dagnelie
5a42ef04fd
Add 'zfs wait' command
Add a mechanism to wait for delete queue to drain.

When doing redacted send/recv, many workflows involve deleting files 
that contain sensitive data. Because of the way zfs handles file 
deletions, snapshots taken quickly after a rm operation can sometimes 
still contain the file in question, especially if the file is very 
large. This can result in issues for redacted send/recv users who 
expect the deleted files to be redacted in the send streams, and not 
appear in their clones.

This change duplicates much of the zpool wait related logic into a 
zfs wait command, which can be used to wait until the internal
deleteq has been drained.  Additional wait activities may be added 
in the future. 

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9707
2020-04-01 10:02:06 -07:00
Fabio Scaccabarozzi
c9e3efdb3a
Bugfix/fix uio partial copies
In zfs_write(), the loop continues to the next iteration without
accounting for partial copies occurring in uiomove_iov when 
copy_from_user/__copy_from_user_inatomic return a non-zero status.
This results in "zfs: accessing past end of object..." in the 
kernel log, and the write failing.

Account for partial copies and update uio struct before returning
EFAULT, leave a comment explaining the reason why this is done.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: ilbsmart <wgqimut@gmail.com>
Signed-off-by: Fabio Scaccabarozzi <fsvm88@gmail.com>
Closes #8673 
Closes #10148
2020-04-01 09:48:54 -07:00
Matthew Ahrens
0929c4de39
Improve ZVOL sync write performance by using a taskq
== Summary ==

Prior to this change, sync writes to a zvol are processed serially.
This commit makes zvols process concurrently outstanding sync writes in
parallel, similar to how reads and async writes are already handled.
The result is that the throughput of sync writes is tripled.

== Background ==

When a write comes in for a zvol (e.g. over iscsi), it is processed by
calling `zvol_request()` to initiate the operation.  ZFS is expected to
later call `BIO_END_IO()` when the operation completes (possibly from a
different thread).  There are a limited number of threads that are
available to call `zvol_request()` - one one per iscsi client (unless
using MC/S).  Therefore, to ensure good performance, the latency of
`zvol_request()` is important, so that many i/o operations to the zvol
can be processed concurrently.  In other words, if the client has
multiple outstanding requests to the zvol, the zvol should have multiple
outstanding requests to the storage hardware (i.e. issue multiple
concurrent `zio_t`'s).

For reads, and async writes (i.e. writes which can be acknowledged
before the data reaches stable storage), `zvol_request()` achieves low
latency by dispatching the bulk of the work (including waiting for i/o
to disk) to a taskq.  The taskq callback (`zvol_read()` or
`zvol_write()`) blocks while waiting for the i/o to disk to complete.
The `zvol_taskq` has 32 threads (by default), so we can have up to 32
concurrent i/os to disk in service of requests to zvols.

However, for sync writes (i.e. writes which must be persisted to stable
storage before they can be acknowledged, by calling `zil_commit()`),
`zvol_request()` does not use `zvol_taskq`.  Instead it blocks while
waiting for the ZIL write to disk to complete.  This has the effect of
serializing sync writes to each zvol.  In other words, each zvol will
only process one sync write at a time, waiting for it to be written to
the ZIL before accepting the next request.

The same issue applies to FLUSH operations, for which `zvol_request()`
calls `zil_commit()` directly.

== Description of change ==

This commit changes `zvol_request()` to use
`taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations.
Therefore we can have up to 32 threads (the taskq threads)
simultaneously calling `zil_commit()`, for a theoretical performance
improvement of up to 32x.

To avoid the locking issue described in the comment (which this commit
removes), we acquire the rangelock from the taskq callback (e.g.
`zvol_write()`) rather than from `zvol_request()`.  This applies to all
writes (sync and async), reads, and discard operations.  This means that
multiple simultaneously-outstanding i/o's which access the same block
can complete in any order.  This was previously thought to be incorrect,
but a review of the block device interface requirements revealed that
this is fine - the order is inherently not defined.  The shorter hold
time of the rangelock should also have a slight performance improvement.

For an additional slight performance improvement, we use
`taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a
`kmem_alloc()` and eliminates a failure mode.  This applies to all
writes (sync and async), reads, and discard operations.

== Performance results ==

We used a zvol as an iscsi target (server) for a Windows initiator
(client), with a single connection (the default - i.e. not MC/S).

We used `diskspd` to generate a workload with 4 threads, doing 1MB
writes to random offsets in the zvol.  Without this change we get
231MB/s, and with the change we get 728MB/s, which is 3.15x the original
performance.

We ran a real-world workload, restoring a MSSQL database, and saw
throughput 2.5x the original.

We saw more modest performance wins (typically 1.5x-2x) when using MC/S
with 4 connections, and with different number of client threads (1, 8,
32).

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10163
2020-03-31 10:50:44 -07:00
George Amanakis
37c22948e5
Reset l2ad_hand and l2ad_first in l2arc_evict
Increasing l2arc_write_size or l2arc_write_boost can result in
l2arc_write_buffers() not having enough space to perform its writes and
panic zio_write_phys().

Instead of resetting l2ad_hand to l2ad_start at the end of
l2arc_write_buffers() and not taking into account a possible
user-mediated increase of l2arc_write_max, we do this in l2arc_evict(),
right after l2arc_write_size() has run. If there is not enough space to
evict (ie we will exceed l2ad_end) we evict to the end of the device,
reset l2ad_hand to l2ad_start, set l2ad_first to 0 and iterate
l2arc_evict(). We avoid infinite iteration of l2arc_evict() by making
sure in l2arc_write_size() that l2ad_start + size does not exceed
l2ad_end.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10154
2020-03-31 10:46:48 -07:00
Ryan Moeller
9a51738b60
Let default arc_c_max be platform dependent
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator.  Other platforms
do not require the same logic.

Implement an arc_default_max() function to determine a default max ARC
size in platform code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10155
2020-03-27 09:14:46 -07:00
Matthew Ahrens
3f38797338
Compile cityhash code into libzfs
Make the cityhash code compile into libzfs, in preparation for the new
"zstream" command.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10152
2020-03-27 09:11:22 -07:00
Dirkjan Bussink
112c1bff94
Remove checks for null out value in encryption paths
These paths are never exercised, as the parameters given are always
different cipher and plaintext `crypto_data_t` pointers.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fueloep <attila@fueloep.org>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes #9661 
Closes #10015
2020-03-26 10:41:57 -07:00
Brian Behlendorf
5351951274
Fix zfs_rmnode() unlink / rollback issue
If a has rollback has occurred while a file is open and unlinked.
Then when the file is closed post rollback it will not exist in the
rolled back version of the unlinked object.  Therefore, the call to
zap_remove_int() may correctly return ENOENT and should be allowed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6812 
Closes #9739
2020-03-18 11:47:07 -07:00
Paul Dagnelie
7145123b0a
Separate warning for incomplete and corrupt streams
This change adds a separate return code to zfs_ioc_recv that is used 
for incomplete streams, in addition to the existing return code for 
streams that contain corruption.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10122
2020-03-17 10:30:33 -07:00
Attila Fülöp
5b3b79559c
ICP: gcm-avx: Support architectures lacking the MOVBE instruction
There are a couple of x86_64 architectures which support all needed
features to make the accelerated GCM implementation work but the
MOVBE instruction. Those are mainly Intel Sandy- and Ivy-Bridge
and AMD Bulldozer, Piledriver, and Steamroller.

By using MOVBE only if available and replacing it with a MOV
followed by a BSWAP if not, those architectures now benefit from
the new GCM routines and performance is considerably better
compared to the original implementation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Followup #9749 
Closes #10029
2020-03-17 10:24:38 -07:00
Matthew Ahrens
7261fc2e81
Improve zfs receive performance by batching writes
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object.  If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant.  For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all.  By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10099
2020-03-16 11:51:56 -07:00
Matthew Ahrens
0fdd6106bb
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock.  For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it.  This may lead to deadlock, since the lock order is
reversed.

Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock.  However, there are a few places in
the send and receive code paths that do not.  For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.

This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock.  In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t).  Thus we do not need to acquire the dp_config_rwlock in
many new places.

I also cleaned up code in dmu_redact_snap() and send_traverse_thread().

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 10:55:02 -07:00
Alexander Motin
fa130e010c
Fix infinite scan on a pool with only special allocations
Attempt to run scrub or resilver on a new pool containing only special
allocations (special vdev added on creation) caused infinite loop
because of dsl_scan_should_clear() limiting memory usage to 5% of pool
size, which it calculated accounting only normal allocation class.

Addition of special and just in case dedup classes fixes the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10106 
Closes #8694
2020-03-12 10:52:03 -07:00
John Poduska
e6b28efccc
Prevent race condition in dnode_dest (#10101)
dnode_special_close() waits for the refcount of dn_holds to go to zero
without holding the dn_mtx. dnode_rele_and_unlock() does the final
remove to dn_holds with dn_mtx being held:

	refs = zfs_refcount_remove(&dn->dn_holds, tag);
	mutex_exit(&dn->dn_mtx);

So, there is a race condition after the remove until dn_mtx is
dropped. During that time, dnode_destroy() can get called, which ends
up in dnode_dest() calling mutex_destroy() and a panic since the lock
is still held.

This change adds a condvar to wait for the final dnode_rele_and_unlock()
to release the dn_mtx before calling dnode_destroy().

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #7814
Closes #10101
2020-03-12 10:25:56 -07:00
Mark Roper
1e9231ada8
Prevent deadlock in arc_read in Linux memory reclaim callback
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes #9987
2020-03-12 10:24:43 -07:00
Matthew Ahrens
1dc32a67e9
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads.  This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks.  Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.

The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread.  It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`.  This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU.  It also induces later
costs by other threads:

 * Since the data was only prefetched, dmu_send()->dmu_dump_write() will
   need to call arc_read() again to get the data.  This will have to
   look up the arc_hdr in the hash table and copy the data from the
   scatter ABD in the arc_hdr to a linear ABD in arc_buf.  This takes
   27% of one CPU.

 * dmu_dump_write() needs to arc_buf_destroy()  This takes 11% of one
   CPU.

 * arc_adjust() will need to evict this arc_hdr, taking about 50% of one
   CPU.

All of these costs can be avoided by bypassing the ARC if the data is
not already cached.  This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.

The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second.  This change
increases the metric by 50%, from ~100,000 to ~150,000.  When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).

In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache.  In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 10:51:04 -07:00
Ryan Moeller
f5f6fb03b7
Change default to overlay=on
Filesystems allow overlay mounts by default on FreeBSD and Linux.

Respect the native convention by switching the default to overlay=on,
while retaining the option to turn the property off for compatibility
with other operating systems' conventions.

Update documentation and tests accordingly.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10030
2020-03-06 09:28:19 -08:00
Brian Behlendorf
f49db9b504
zio: dprintf_bp() if errors > 0 in zfs_blkptr_verify()
Also dprintf_bp() in case BLK_VERIFY_HALT of zfs_blkptr_verify_log()
since dprintf_bp() in zfs_blkptr_verify() will never be executed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Justin Keogh <commits@v6y.net>
Closes #10086
2020-03-04 15:08:41 -08:00
Brian Behlendorf
2288d41968
Add trim support to zpool wait
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #10071
2020-03-04 15:07:11 -08:00
Matthew Ahrens
b3212d2fa6
Improve performance of zio_taskq_member
__zio_execute() calls zio_taskq_member() to determine if we are running
in a zio interrupt taskq, in which case we may need to switch to
processing this zio in a zio issue taskq.  The call to
zio_taskq_member() can become a performance bottleneck when we are
processing a high rate of zio's.

zio_taskq_member() calls taskq_member() on each of the zio interrupt
taskqs, of which there are 21.  This is slow because each call to
taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively
slow.

This commit improves the performance of zio_taskq_member() by having it
cache the value of tsd_get(taskq_tsd), reducing the number of those
calls to 1/21th of the current behavior.

In a test case running `zfs send -c >/dev/null` of a filesystem with
small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of
one CPU, and with this change it is reduced to 1.3%.  Overall time to
perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000
blocks/sec).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10070
2020-03-03 10:29:38 -08:00
Ryan Moeller
9bb907bc3f
Make spa_history_zone platform-dependent in kernel
This function should only return "linux" on Linux.

Move the kernel part of the function out of common code.
Fix the tests for FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10079
2020-03-02 09:43:30 -08:00
Matthew Macy
cf118ae8dc
Don't call zrele on passed zp in zfs_xattr_owner_unlinked on FreeBSD
FreeBSD has a somewhat more cumbersome locking and refcounting
protocol for the platform counterpart to znode. We need to not call
zrele on the passed zp, but do need to do so on any intermediate zp.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10075
2020-02-28 14:53:18 -08:00
Matthew Macy
ae9f92f6f3
Re-share zfsdev_getminor and zfs_onexit_fd_hold
By adding a zfs_file_private accessor to the common
interfaces and some extensions to FreeBSD platform
code it is now possible to share the implementations
for the aforementioned functions.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10073
2020-02-28 14:50:32 -08:00
Matthew Ahrens
9cdf7b1f6b
Improve zfs destroy performance with zio_t-free zio_free()
When "zfs destroy" is run, it completes quickly, and in the background
we locate the blocks to free and free them.  This background activity
can be observed with `zpool get freeing` and `zpool wait -t free ...`.

This background activity is processed by a single thread (the spa_sync
thread) which calls zio_free() on each of the blocks to free.  With even
modest storage performance, the CPU consumption of zio_free() can be the
performance bottleneck.

Performance of zio_free() can be improved by not actually creating a
zio_t in the common case (non-dedup, non-gang), instead calling
metaslab_free() directly.  This avoids the CPU cost of allocating the
zio_t, and more importantly the cost of adding and later removing this
zio_t from the parent zio's child list.

The result is that performance of background freeing more than doubles,
from 0.6 million blocks per second to 1.3 million blocks per second.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10034
2020-02-28 14:49:44 -08:00
Brian Behlendorf
bd0d24e09b
Linux 5.5 compat: blkg_tryget()
Commit https://github.com/torvalds/linux/commit/9e8d42a0f accidentally
converted the static inline function blkg_tryget() to GPL-only for
kernels built with CONFIG_PREEMPT_RCU=y and CONFIG_BLK_CGROUP=y.

Resolve the build issue by providing our own equivalent functionality
when needed which uses rcu_read_lock_sched() internally as before.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9745
Closes #10072
2020-02-28 08:58:39 -08:00
Matthew Macy
13fac09868
Consolidate arc_buf allocation checks
The following check currently occurs in three separate locations
in dbuf.c.  This change consolidates those checks in to the
dbuf_alloc_arcbuf_from_arcbuf() function.

if (arc_is_encrypted(data)) {
...
} else if (compress_type != ZIO_COMPRESS_OFF) {
...
} else {
...
}

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10057
2020-02-27 17:12:44 -08:00
Brian Behlendorf
2c3a83701d Linux 5.6 compat: time_t
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired.  Callers are now required to use the time64_t type.

Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code.  They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.

Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.

Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10052
Closes #10064
2020-02-27 09:31:02 -08:00
Matthew Macy
28caa74b19
Refactor dnode dirty context from dbuf_dirty
* Add dedicated donde_set_dirtyctx routine.
* Add empty dirty record on destroy assertion.
* Make much more extensive use of the SET_ERROR macro.

Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9924
2020-02-26 16:09:17 -08:00
Matthew Macy
c6a6b4d50a
Remove dead code error handling from dsl_crypt.c
Sleepable (KM_SLEEP) allocations cannot fail. Hence
error handling for them is not useful.

Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10031
2020-02-25 15:59:29 -08:00
Dirkjan Bussink
327000ce04
Remove zfs_getattr and convoff dead code
The `convoff` function is called only in one code path in `zfs_space`.
Each caller of `zfs_space` is called with a `flock64_t` that has
`l_whence` set to `SEEK_SET`. This means that `convoff` always results
in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and
`int whence` is `SEEK_SET` as well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes #10006
2020-02-24 15:38:22 -08:00
Matthew Ahrens
31a69fbccb
Remove unused structs and members in dmu_send.c
There are several structs (and members of structs) related to redaction,
which are no longer used.  This commit removes them.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10039
2020-02-24 09:50:14 -08:00
Arvind Sankar
65635c3403
Fix icp include directories for in-tree build
When zfs is built in-tree using --enable-linux-builtin, the compile
commands are executed from the kernel build directory. If the build
directory is different from the kernel source directory, passing
-Ifs/zfs/icp will not find the headers as they are not present in the
build directory.

Fix this by adding @abs_top_srcdir@ to pull the headers from the zfs
source tree instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10021
2020-02-20 08:10:47 -08:00
Ryan Moeller
5f087dda78
Enable zpool events tunables and tests on FreeBSD
We have have made the necessary changes in our module code to expose
zevents through both devd and the zpool events ioctl. Now the tunables
can be exposed and zpool events tests can be enabled on both platforms.

A few minor tweaks to the tests were needed to accommodate the way wc
formats output on FreeBSD.

zed remains to be ported.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10008
2020-02-18 11:22:56 -08:00
Matthew Macy
8b3547a481
Factor out some dbuf subroutines and add state change tracing
Create dedicated dbuf_read_hole and dbuf_read_bonus.
Additionally, add a dtrace probe to allow state change tracing.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Authored-by: Will Andrews <wca@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9923
2020-02-18 11:21:37 -08:00
Richard Laager
f244846462
Prefer org.openzfs for features and properties
Moving forward, we wish to use org.openzfs (no dash) rather than
org.open-zfs or org.zfsonlinux for feature GUIDs and property names.
The existing feature GUIDs cannot be changed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #10003
2020-02-18 09:36:50 -08:00
DeHackEd
d09dc5980c
Honour sync=disabled when relinking tpmfiles
Unlinked files don't respect synchronous flush commands, but when they get relinked
their state is unknown. Previously we force flushed all such files even when
sync=disabled. Correct this case.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes #10005
2020-02-16 12:44:08 -08:00
Jason King
13b5a4d5c0
Support setting user properties in a channel program
This adds support for setting user properties in a
zfs channel program by adding 'zfs.sync.set_prop'
and 'zfs.check.set_prop' to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Contributions-by: Jason King <jason.king@joyent.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9950
2020-02-14 13:41:42 -08:00
Matthew Ahrens
4fe3a842bb
Remove limit on number of async zio_frees of non-dedup blocks
The module parameter zfs_async_block_max_blocks limits the number of
blocks that can be freed by the background freeing of filesystems and
snapshots (from "zfs destroy"), in one TXG.  This is useful when freeing
dedup blocks, becuase each zio_free() of a dedup block can require an
i/o to read the relevant part of the dedup table (DDT), and will also
dirty that block.

zfs_async_block_max_blocks is set to 100,000 by default.  For the more
typical case where dedup is not used, this can have a negative
performance impact on the rate of background freeing (from "zfs
destroy").  For example, with recordsize=8k, and TXG's syncing once
every 5 seconds, we can free only 160MB of data per second, which may be
much less than the rate we can write data.

This change increases zfs_async_block_max_blocks to be unlimited by
default.  To address the dedup freeing issue, a new tunable is
introduced, zfs_max_async_dedup_frees, which limits the number of
zio_free()'s of dedup blocks done by background destroys, per txg.  The
default is 100,000 free's (same as the old zfs_async_block_max_blocks
default).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10000
2020-02-14 08:39:46 -08:00
Matthew Ahrens
2adc6b35ae
Missed wakeup when growing kmem cache
When growing the size of a (VMEM or KVMEM) kmem cache, spl_cache_grow()
always does taskq_dispatch(spl_cache_grow_work), and then waits for the
KMC_BIT_GROWING to be cleared by the taskq thread.

The taskq thread (spl_cache_grow_work()) does:
1. allocate new slab and add to list
2. wake_up_all(skc_waitq)
3. clear_bit(KMC_BIT_GROWING)

Therefore, the waiting thread can wake up before GROWING has been
cleared.  It will see that the growing has not yet completed, and go
back to sleep until it hits the 100ms timeout.

This can have an extreme performance impact on workloads that alloc/free
more than fits in the (statically-sized) magazines.  These workloads
allocate and free slabs with high frequency.

The problem can be observed with `funclatency spl_cache_grow`, which on
some workloads shows that 99.5% of the time it takes <64us to allocate
slabs, but we spend ~70% of our time in outliers, waiting for the 100ms
timeout.

The fix is to do `clear_bit(KMC_BIT_GROWING)` before
`wake_up_all(skc_waitq)`.

A future investigation should evaluate if we still actually need to
taskq_dispatch() at all, and if so on which kernel versions.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9989
2020-02-13 11:23:02 -08:00
Alexander Motin
465e4e795e
Remove duplicate dbufs accounting
Since AVL already has embedded element counter, use dn_dbufs_count
only for dbufs not counted there (bonus buffers) and just add them.
This removes two atomics per dbuf life cycle.

According to profiler it reduces time spent by dbuf_destroy() inside
bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9949
2020-02-13 11:20:42 -08:00
Christian Schwarz
948f0c4419 zcp: add zfs.sync.bookmark
Add support for bookmark creation and cloning.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:17 -08:00
Christian Schwarz
a73f361fdb Implement bookmark copying
This feature allows copying existing bookmarks using

    zfs bookmark fs#target fs#newbookmark

There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.

Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).

ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.

Overview:

- Terminology:
    - source = existing snapshot or bookmark
    - new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
  - create new bookmark node
  - copy source's `zbn_phys` to new's `zbn_phys`
  - zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
  bookmarks as sources
  - => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
  and bookmark sources
- Adjust CLI
  - refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
  - warn about redaction bookmark handling
- Add test cases
  - CLI
  - pyyzfs libzfs_core bindings

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:12 -08:00
Matthew Macy
7b49bbc816
Address Coverity warnings in #9902
Coverity reports the variable may be NULL, but due to the
way the dirty records are handled this cannot be the case.
Add a comment and VERIFY to make this clear and silence
the warning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9962
2020-02-11 13:12:41 -08:00
Brian Behlendorf
dceeca5bbd
Add missing dmu_buf_unlock_parent() calls to dbuf_read_impl()
As explained by the comment in dbuf_read() and above dbuf_read_impl().
Under all circumstances the parent lock specified by dblt should be
dropped when existing dbuf_read_impl().  This was not being done for
two exist paths.  Additionally, ensure the mutex is unlocked before
dropping the parent lock.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9968
2020-02-10 14:54:12 -08:00
Paul Zuchowski
bc67cba7c0
Fix zdb -R with 'b' flag
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display.  Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9640
Closes #9729
2020-02-10 14:00:05 -08:00
Ryan Moeller
57940b435c
Share some code for spa deadman tunables
We need to do the same thing to update all spas on any OS for these
tunables, so let's share the code.

While here let's match the types of the literals initializing the
variables with the type of the variable.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9964
2020-02-10 13:11:30 -08:00
Attila Fülöp
31b160f0a6
ICP: Improve AES-GCM performance
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9749
2020-02-10 12:59:50 -08:00
Matthew Macy
fa3922df75
Factor out dbuf_sync_bonus
Factor the portion of dbuf_sync_leaf() responsible for handling bonus
buffers out in to its own dbuf_sync_bonus() helper function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9909
2020-02-07 14:22:29 -08:00
Brian Behlendorf
795699a6cc Linux 5.6 compat: timestamp_truncate()
The timestamp_truncate() function was added, it replaces the existing
timespec64_trunc() function.  This change renames our wrapper function
to be consistent with the upstream name and updates the compatibility
code for older kernels accordingly.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9956
Closes #9961
2020-02-07 11:04:32 -08:00
Brian Behlendorf
0dd7364853 Linux 5.6 compat: struct proc_ops
The proc_ops structure was introduced to replace the use of of the
file_operations structure when registering proc handlers.  This
change creates a new kstat_proc_op_t typedef for compatibility
which can be used to pass around the correct structure.

This change additionally adds the 'const' keyword to all of the
existing proc operations structures.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9961
2020-02-07 11:03:53 -08:00
Alexander Motin
e0ce98d57c
Reduce number of atomic_add() calls in aggsum
Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to
re-borrow after the flush.  But since asc_borrowed and asc_delta are
accessed only while holding asc_lock, it makes no any sense to modify
as_lower_bound and as_upper_bound in multiple steps.  Instead of that
the new code uses only 2 atomics in all the cases, one per as_*_bound
variable.  I think even that is overkill, simple atomic store and
load could be used here, since all modifications are done under the
as_lock, but there are no such primitives in ZFS code now.

While there, make borrow code consider previous borrow value, so that
on mixed request patterns reduce chance of needing to borrow again if
much larger request follows tiny one that needed borrow.

Also reduce as_numbuckets from uint64_t to u_int.  It makes no sense
to use so large division operation on every aggsum_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9930
2020-02-06 13:21:06 -08:00
Romain Dolbeau
77122f9d68
Replace static per-cpu with dynamic per-cpu data
This solves the issue of loading the spl module on RISC-V.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes #9942
2020-02-06 09:26:13 -08:00
Alexander Motin
cbd8f5b759
Few microoptimizations to dbuf layer
Move db_link into the same cache line as db_blkid and db_level.
It allows significantly reduce avl_add() time in dbuf_create() on
systems with large RAM and huge number of dbufs per dnode.

Avoid few accesses to dbuf_caches[].size, which is highly congested
under high IOPS and never stays in cache for a long time.  Use local
value we are receiving from zfs_refcount_add_many() any way.

Remove cache_size_bytes_max bump from dbuf_evict_one().  I don't see
a point to do it on dbuf eviction after we done it on insertion in
dbuf_rele_and_unlock().

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9931
2020-02-05 11:08:44 -08:00
Matthew Macy
cccbed9f98
Convert dbuf dirty record record list to a list_t
Additionally pull in state machine comments about
upcoming async cow work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9902
2020-02-05 11:07:19 -08:00
Alexander Motin
741db5a346
Prepare ks_data before calling kstat_install()
It violated sequence described in kstat.h, and at least on FreeBSD
kstat_install() uses provided names to create the sysctls.  If the
names are not available at the time, it ends up bad.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9933
2020-02-04 08:49:12 -08:00
Ryan Moeller
8c4987c489
Restore aclmode and remove acltype on FreeBSD
This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE,
matching what is done in the NFSv4 ACLs PR (#9709).

On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide
ZFS_PROP_ACLMODE.

The tests already assume this arrangement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9913
2020-02-04 08:40:07 -08:00
Ryan Moeller
07bc2bc231
Fix const-correctness in raidz math
Clang warns (errors) that "cast from 'const void *' to 'struct v *'
drops const qualifier."

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9917
2020-02-03 10:52:41 -08:00
Matthew Ahrens
ec21397127
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 09:33:14 -08:00
Ryan Moeller
fe7c15985b
Left-align index props
Index type props display as strings, which should be aligned to the
left not to the right.

Before:
```
FreeBSD-13_0-CURRENT-r356528 ➜  ~ zfs list -ro name,aclmode,mountpoint
NAME        ACLMODE  MOUNTPOINT
p0      passthrough  /p0
p0/foo      discard  /p0/foo
```

After:
```
FreeBSD-13_0-CURRENT-r356528 ➜  ~ zfs list -ro name,aclmode,mountpoint
NAME    ACLMODE      MOUNTPOINT
p0      passthrough  /p0
p0/foo  discard      /p0/foo
```

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9912
2020-01-31 08:55:51 -08:00
Christian Schwarz
20ea8540a6 dsl_bookmark_create_check: fix NULL pointer deref if dbca_errors == NULL
Discovered in preparation of zcp support for creating bookmarks.
Handle the case where dbca_errors is NULL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9880
2020-01-23 21:13:42 -08:00
Christian Schwarz
3aea3c9d54 entity_namecheck: doc comment: include space as allowed character
The helper function valid_char already allows it but
the doc comment was out of date.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9879
2020-01-23 21:11:54 -08:00
Romain Dolbeau
35b07497c6 Add AltiVec RAID-Z
Implements the RAID-Z function using AltiVec SIMD.
This is basically the NEON code translated to AltiVec.

Note that the 'fletcher' algorithm requires 64-bits
operations, and the initial implementations of AltiVec
(PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to
32-bits operations, so no 'fletcher'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes #9539
2020-01-23 11:01:24 -08:00
Christian Schwarz
0ea03c7c82 dmu_send: redacted: fix memory leak on invalid redaction/from bookmark
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9867
2020-01-23 09:34:31 -08:00
Matthew Macy
3d91490f7c Simplify FreeBSD's locking requirements in zfs_replay.c
Now that the FreeBSD zfs_vnops code avoids asserting that
a vnode lock is held when z_replay is true we can limit
the FreeBSD specific changes to the couple of changes
where it is necessary to drop the vnode locks because
a function returns with it held.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9865
2020-01-22 17:55:56 -08:00
Jason King
e2ef1cbf04 Support inheriting properties in channel programs
This adds support in channel programs to inherit properties analogous
to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit`
functions to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9738
2020-01-22 17:03:17 -08:00
Matthew Macy
af26a86958 Update tunable macro usage for disable_ivset_guid_check
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9861
2020-01-21 15:05:23 -08:00
Matthew Macy
d3c1e45b7a Re-consolidate zio_delay_interrupt
With recent SPL changes there is no longer any need for a per
platform version.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9860
2020-01-21 15:04:13 -08:00
Brian Behlendorf
70835c5b75
Unify target_cpu handling
Over the years several slightly different approaches were used
in the Makefiles to determine the target architecture.  This
change updates both the build system and Makefile to handle
this in a consistent fashion.

TARGET_CPU is set to i386, x86_64, powerpc, aarch6 or sparc64
and made available in the Makefiles to be used as appropriate.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9848
2020-01-17 12:40:09 -08:00
Tom Caputi
61152d1069 Fix errata #4 handling for resuming streams
Currently, the handling for errata #4 has two issues which allow
the checks for this issue to be bypassed using resumable sends.
The first issue is that drc->drc_fromsnapobj is not set in the
resuming code as it is in the non-resuming code. This causes
dsl_crypto_recv_key_check() to skip its checks for the
from_ivset_guid. The second issue is that resumable sends do not
clean up their on-disk state if they fail the checks in
dmu_recv_stream() that happen before any data is received.

As a result of these two bugs, a user can attempt a resumable send
of a dataset without a from_ivset_guid. This will fail the initial
dmu_recv_stream() checks, leaving a valid resume state. The send
can then be resumed, which skips those checks, allowing the receive
to be completed.

This commit fixes these issues by setting drc->drc_fromsnapobj in
the resuming receive path and by ensuring that resumablereceives
are properly cleaned up if they fail the initial dmu_recv_stream()
checks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9818 
Closes #9829
2020-01-14 12:25:20 -08:00
loli10K
7e2da7786e KMC_KVMEM disrupts kv_alloc() memory alignment expectations
On kernels with KASAN enabled the following failure can be observed as
soon as the zfs module is loaded:

  VERIFY(IS_P2ALIGNED(ptr, PAGE_SIZE)) failed
  PANIC at spl-kmem-cache.c:228:kv_alloc()

The problem is kmalloc() has never guaranteed aligned allocations; this
requirement resulted in zfsonlinux/spl@8b45dda which removed all
kmalloc() usage in kv_alloc().

Until a GFP_ALIGNED flag (or equivalent functionality) is provided by
the kernel this commit partially reverts 66955885 and 6d948c35 to
prevent k(v)malloc() allocations in kv_alloc().

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9813
2020-01-14 09:09:59 -08:00
Brian Behlendorf
e458fcca75
Change http://zfsonlinux.org links to https://zfsonlinux.org
Update the project website links contained in to repository to
reference the secure https://zfsonlinux.org address.

Reviewed-By: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9837
2020-01-13 16:43:59 -08:00
Tom Caputi
ba0ba69e50 Add 'zfs send --saved' flag
This commit adds the --saved (-S) to the 'zfs send' command.
This flag allows a user to send a partially received dataset,
which can be useful when migrating a backup server to new
hardware. This flag is compatible with resumable receives, so
even if the saved send is interrupted, it can be resumed.
The flag does not require any user / kernel ABI changes or any
new feature flags in the send stream format.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Christian Schwarz <me@cschwarz.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9007
2020-01-10 10:16:58 -08:00
loli10K
c24fa4b19a Fix "zpool add -n" for dedup, special and log devices
For dedup, special and log devices "zpool add -n" does not print
correctly their vdev type:

~# zpool add -n pool dedup /tmp/dedup special /tmp/special log /tmp/log
would update 'pool' to the following configuration:
	pool
	  /tmp/normal
	  /tmp/dedup
	  /tmp/special
	  /tmp/log

This could lead storage administrators to modify their ZFS pools to
unexpected and unintended vdev configurations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9783 
Closes #9390
2020-01-06 15:40:06 -08:00
Brian Behlendorf
bc9cef11fd
Fix QAT allocation failure return value
When qat_compress() fails to allocate the required contiguous memory
it mistakenly returns success.  This prevents the fallback software
compression from taking over and (un)compressing the block.

Resolve the issue by correctly setting the local 'status' variable
on all exit paths.  Furthermore, initialize it to CPA_STATUS_FAIL
to ensure qat_compress() always fails safe to guard against any
similar bugs in the future.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9784
Closes #9788
2020-01-06 11:17:53 -08:00
Brian Behlendorf
cc618d179e
Static symbols exported by ICP
The crypto_cipher_init_prov and crypto_cipher_init are declared static
and should not be exported by the ICP.  This resolves the following
warnings observed when building with the 5.4 kernel.

WARNING: "crypto_cipher_init" [.../icp] is a static EXPORT_SYMBOL
WARNING: "crypto_cipher_init_prov" [.../icp] is a static EXPORT_SYMBOL

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9791
2020-01-02 18:08:45 -08:00
Steve Mokris
d5c97f3de7 Avoid some crashes when importing a pool with corrupt metadata
- Skip invalid DVAs when importing pools in readonly mode
  (in addition to when the config is untrusted).

- Upon encountering a DVA with a null VDEV, fail gracefully
  instead of panicking with a NULL pointer dereference.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Steve Mokris <smokris@softpixel.com>
Closes #9022
2019-12-26 10:57:05 -08:00
Brian Behlendorf
635a01aafd
Cancel initialize and TRIM before vdev_metaslab_fini()
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8602
Closes #9751
2019-12-26 10:50:23 -08:00
Brian Behlendorf
d16a207f2e cppcheck: (warning) Possible null pointer dereference: nvh
Move the 'nvh = (void *)buf' assignment after the 'buf == NULL'
check to resolve the warning.  Interestingly, cppcheck 1.88
correctly determines that the existing code is safe, while
cppcheck 1.86 reports the warning.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9732
2019-12-18 17:25:57 -08:00
Brian Behlendorf
487bddad67 cppcheck: (error) Address of local auto-variable assigned
Suppress autoVariables warnings in the lua interpreter.  The usage
here while unconventional in intentional and the same as upstream.

[module/lua/ldebug.c:327]: (error) Address of local auto-variable
    assigned to a function parameter.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9732
2019-12-18 17:25:42 -08:00
Brian Behlendorf
070402f112 cppcheck: (warning) Possible null pointer dereference: dnp
The dnp argument can only be set to NULL when the DNODE_DRY_RUN flag
is set.  In which case, an early return path will be executed and a
NULL pointer dereference at the given location is impossible.  Add
an additional ASSERT to silence the cppcheck warning and document
that dbp must never be NULL at the point in the function.

[module/zfs/dnode.c:1566]: (warning) Possible null pointer deref: dnp

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9732
2019-12-18 17:25:13 -08:00
Ubuntu
abfdb83607 cppcheck: (error) Shifting signed 64-bit value by 63 bits
As of cppcheck 1.82 surpress the warning regarding shifting too many
bits for __divdi3() implemention.  The algorithm used here is correct.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9732
2019-12-18 17:24:42 -08:00
Ubuntu
7cf1fe6331 cppcheck: (error) Uninitialized variable
As of cppcheck 1.82 warnings are issued when using the list_for_each_*
functions with an uninitialized variable.  Functionally, this is fine
but to resolve the warning initialize these variables.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9732
2019-12-18 17:24:29 -08:00
Ubuntu
5215fdd43c cppcheck: (error) Uninitialized variable
Resolve the following uninitialized variable warnings.  In practice
these were unreachable due to the goto.  Replacing the goto with a
return resolves the warning and yields more readable code.

[module/icp/algs/modes/ccm.c:892]: (error) Uninitialized variable: ccm_param
[module/icp/algs/modes/ccm.c:893]: (error) Uninitialized variable: ccm_param
[module/icp/algs/modes/gcm.c:564]: (error) Uninitialized variable: gcm_param
[module/icp/algs/modes/gcm.c:565]: (error) Uninitialized variable: gcm_param
[module/icp/algs/modes/gcm.c:599]: (error) Uninitialized variable: gmac_param
[module/icp/algs/modes/gcm.c:600]: (error) Uninitialized variable: gmac_param

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9732
2019-12-18 17:23:54 -08:00
Romain Dolbeau
118fc3ef07 Minor performance fix for NEON RAID-Z
The NEON code replicates too closely the SSE code, including
a masked 16-bits shift. But NEON, like AltiVec (#9539), has
unsigned 8-bits shift, so use that instead and drop the masking.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes #9725
2019-12-17 19:34:52 -08:00
Matthew Macy
ba434b18ec Fix zfs_xattr_owner_unlinked on FreeBSD and comment
Explain FreeBSD VFS' unfortunate idiosyncratic locking requirements.
There is no functional change for other platforms.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9720
2019-12-16 09:49:05 -08:00
Tomohiro Kusumi
ddb4e69db5 Don't fail to apply umask for O_TMPFILE files
Apply umask to `mode` which will eventually be applied to inode.
This is needed since VFS doesn't apply umask for O_TMPFILE files.

(Note that zpl_init_acl() applies `ip->i_mode &= ~current_umask();`
only when POSIX ACL is used.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8997 
Closes #8998
2019-12-13 15:02:23 -08:00
Tom Caputi
c317c8c811 Allow empty ds_props_obj to be destroyed
Currently, 'zfs list' and 'zfs get' commands can be slow when
working with snapshots that have a ds_props_obj. This is
because the code that discovers all of the properties for these
snapshots needs to read this object for each snapshot, which
almost always ends up causing an extra random synchronous read
for each snapshot. This performance penalty exists even if the
properties on that snapshot have been unset because the object
is normally only freed when the snapshot is freed, even though
it is only created when it is needed.

This patch allows the user to regain 'zfs list' performance on
these snapshots by destroying the ds_props_obj when it no longer
has any entries left. In practice on a production machine, this
optimization seems to make 'zfs list' about 55% faster.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9704
2019-12-13 11:51:39 -08:00
Matthew Macy
13a9a6f5e8 Make zfs_replay.c work on FreeBSD
FreeBSD's vfs currently doesn't permit file systems
to do their own locking. To avoid having to have
duplicate zfs functions with and without locking add
locking here. With luck these changes can be removed
in the future.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9715
2019-12-13 07:54:10 -08:00
Matthew Ahrens
9bb0d89c5c Fix use-after-free of vd_path in spa_vdev_remove()
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed).  Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().

Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.

Found by AddressSanitizer:

ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
    #0 0x7fe88b0c166d  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
    #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
    #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
    #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    #4 0x55ffbc769fba in ztest_execute ztest.c:6714
    #5 0x55ffbc779a90 in ztest_thread ztest.c:6761
    #6 0x7fe889cbc6da in start_thread
    #7 0x7fe8899e588e in __clone

0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
    #0 0x7fe88b14e7b8 in __interceptor_free
    #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
    #2 0x7fe88ae543ba in nvpair_free nvpair.c:844
    #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
    #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
    #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
    #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    #7 0x55ffbc769fba in ztest_execute ztest.c:6714
    #8 0x55ffbc779a90 in ztest_thread ztest.c:6761
    #9 0x7fe889cbc6da in start_thread

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9706
2019-12-11 15:38:21 -08:00
Ryan Moeller
957c7aa23c Relocate common quota functions to shared code
The quota functions are common to all implementations and can be
moved to common code.  As a simplification they were moved to the
Linux platform code in the initial refactoring.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9710
2019-12-11 12:12:08 -08:00
Matthew Macy
4bc721965f Add FreeBSD jail support hooks
Add the 'zfs jail/unjail' subcommands along with the relevant 
documentation from FreeBSD.  This feature is not supported on
Linux and still requires the match kernel ioctls which will
be included when the FreeBSD platform code is integrated.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9686
2019-12-11 11:58:37 -08:00
Matthew Macy
657ce25357 Eliminate Linux specific inode usage from common code
Change many of the znops routines to take a znode rather
than an inode so that zfs_replay code can be largely shared
and in the future the much of the znops code may be shared.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9708
2019-12-11 11:53:57 -08:00
Paul Zuchowski
f0bf435176 zio_decompress_data always ASSERTs successful decompression
This interferes with zdb_read_block trying all the decompression
algorithms when the 'd' flag is specified, as some are
expected to fail.  Also control the output when guessing
algorithms, try the more common compression types first, allow
specifying lsize/psize, and fix an uninitialized variable.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9612 
Closes #9630
2019-12-10 15:51:58 -08:00
Matthew Macy
362ae8d11f Abstract away platform specific superblock references
The zfsvfs->z_sb field is Linux specified and should be abstracted.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9697
2019-12-10 09:21:07 -08:00
Matthew Macy
3c502d3b75 Exclude data from cores unconditionally and metadata conditionally
This change allows us to align the code dump logic across platforms.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9691
2019-12-09 12:29:56 -08:00
Matthew Macy
ea79e90f99 Mark dsl_dataset_deactivate_feature_impl static
The dsl_dataset_deactivate_feature_impl() function is private and
should be marked as such.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9696
2019-12-09 12:26:33 -08:00
Brian Behlendorf
a25861dcae
ZTS: Fix zpool_reopen_001_pos
Update the vdev_disk_open() retry logic to use a specified number
of milliseconds to be more robust.  Additionally, on failure log
both the time waited and requested timeout to the internal log.

The default maximum allowed open retry time has been increased
from 500ms to 1000ms.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9680
2019-12-09 11:09:14 -08:00
Matthew Macy
0dcef9b966 Disable sysfs feature checks on FreeBSD
The sysfs infrastructure for reporting supported features and
properties is Linux specific.  Disable it on FreeBSD until it can
be extended to be more portable.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9684
2019-12-06 09:44:29 -08:00
Attila Fülöp
3ac34ca375 ICP: Fix out of bounds write
If gcm_mode_encrypt_contiguous_blocks() is called more than once
in succession, with the accumulated lengths being less than
blocksize, ctx->copy_to will be incorrectly advanced. Later, if
out is NULL, the bcopy at line 114 will overflow
ctx->gcm_copy_to since ctx->gcm_remainder_len is larger than the
ctx->gcm_copy_to buffer can hold.

The fix is to set ctx->copy_to only if it's not already set.

For ZoL the issue may be academic, since in all my testing I wasn't
able to hit neither of both conditions needed to trigger it, but
other consumers can easily do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9660
2019-12-06 09:36:19 -08:00
Matthew Macy
f95704ca5e Disable EDONR on FreeBSD
FreeBSD uses its own crypto framework in-kernel which, at this time,
has no EDONR implementation.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9664
2019-12-05 13:10:29 -08:00
Matthew Macy
e64e84eca5 Refactor deadman set failmode to be cross platform
Update zfs_deadman_failmode to use the ZFS_MODULE_PARAM_CALL
wrapper, and split the common and platform specific portions.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9670
2019-12-05 12:40:45 -08:00
Matthew Macy
2a8ba608d3 Replace ASSERTV macro with compiler annotation
Remove the ASSERTV macro and handle suppressing unused 
compiler warnings for variables only in ASSERTs using the 
__attribute__((unused)) compiler annotation.  The annotation
is understood by both gcc and clang.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9671
2019-12-05 12:37:00 -08:00
Attila Fülöp
54c8366e39 ICP: Fix null pointer dereference and use after free
In gcm_mode_decrypt_contiguous_blocks(), if vmem_alloc() fails,
bcopy is called with a NULL pointer destination and a length > 0.
This results in undefined behavior. Further ctx->gcm_pt_buf is
freed but not set to NULL, leading to a potential write after
free and a double free due to missing return value handling in
crypto_update_uio(). The code as is may write to ctx->gcm_pt_buf
in gcm_decrypt_final() and may free ctx->gcm_pt_buf again in
aes_decrypt_atomic().

The fix is to slightly rework error handling and check the return
value in crypto_update_uio().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9659
2019-12-03 10:28:47 -08:00
Alexander Motin
5ff2249fa5 Fix use-after-free in case of L2ARC prefetch failure
In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO
to read data from the original storage device.  Unfortunately pointer
to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if
some other read try to bump the ZIO priority, it will crash.

The problem is reproducible by corrupting L2ARC content and reading
some data with prefetch if l2arc_noprefetch tunable is changed to 0.
With the default setting the issue is probably not reproducible now.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9648
2019-12-03 09:59:30 -08:00
Brian Behlendorf
624222ae31
Increase allowed 'special_small_blocks' maximum value
There may be circumstances where it's desirable that all blocks
in a specified dataset be stored on the special device.  Relax
the artificial 128K limit and allow the special_small_blocks
property to be set up to 1M.  When blocks >1MB have been enabled
via the zfs_max_recordsize module option, this limit is increased
accordingly.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9131
Closes #9355
2019-12-03 09:58:03 -08:00
Matthew Macy
b3673342c7 Wrap module_param_call() routines under __linux__
The module_param_call() functionality is currently still
Linux-specific and should be wrapped accordingly.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9666
2019-12-03 09:56:15 -08:00
Matthew Macy
bff8fb395b Mark write_record static
The write_record() function is private and should be marked as such.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9665
2019-12-03 09:51:44 -08:00
Matthew Macy
74d1d74959 Move linux qsort def to platform header
Moving qsort to the platform header allows each platform to
provide an appropriate sorting implementation.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9663
2019-12-03 09:49:40 -08:00
Michael Niewöhner
e69bb31b71 Adapt gitignore for modules
Remove the specific gitignore rules for module left-overs and add a
generic one in modules/.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9656
2019-12-02 13:23:47 -08:00
Matthew Macy
5142032106 Move zfs_cmd_t copyin/copyout to platform code
FreeBSD needs to cope with multiple version of the zfs_cmd_t
structure. Allowing the platform code to pre and post
process the cmd structure makes it possible to work with
legacy tooling.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9624
2019-12-02 10:08:27 -08:00
Matthew Macy
758699b6f1 Restructure nvlist_nv_alloc to work on FreeBSD
KM_PUSHPAGE is an Illumosism - On FreeBSD it's
aliased to the same malloc flag as KM_SLEEP.
The compiler naturally rejects multiple case
statements with the same value.  This is effectively
a no-op since all callers pass a specific KM_* flag.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9643
2019-11-30 15:45:06 -08:00
Matthew Macy
f348c78f97 Mark Linux fallocate extensions as specific to Linux
fallocate(2) is a Linux-specific system call which in unavailable
on other platforms.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9633
2019-11-30 15:40:22 -08:00
Matthew Macy
a5b762ab1d Resolve ZoF differences in zfs_ioctl.h
FreeBSD needs to be able to pass the jail id to the jail/unjail ioctls
and the struct file in the device structure is unused.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9625
2019-11-30 15:35:54 -08:00
Brian Behlendorf
9e17e6f254
Remove zfs_vdev_elevator module option
As described in commit f81d5ef6 the zfs_vdev_elevator module
option is being removed.  Users who require this functionality
should update their systems to set the disk scheduler using a
udev rule.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9417
Closes #9609
2019-11-27 10:35:49 -08:00
jwpoduska
3c819a2c7d Prevent unnecessary resilver restarts
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #840 
Closes #9155
Closes #9378
Closes #9551
Closes #9588
2019-11-27 10:15:01 -08:00
Mauricio Faria de Oliveira
0c46813805 Check for unlinked znodes after igrab()
The changes in commit 41e1aa2a / PR #9583 introduced a regression on
tmpfile_001_pos: fsetxattr() on a O_TMPFILE file descriptor started
to fail with errno ENODATA:

    openat(AT_FDCWD, "/test", O_RDWR|O_TMPFILE, 0666) = 3
    <...>
    fsetxattr(3, "user.test", <...>, 64, 0) = -1 ENODATA

The originally proposed change on PR #9583 is not susceptible to it,
so just move the code/if-checks around back in that way, to fix it.

Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Heitor Alves de Siqueira <halves@canonical.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes #9602
2019-11-21 12:24:03 -08:00
Matthew Macy
da92d5cbb3 Add zfs_file_* interface, remove vnodes
Provide a common zfs_file_* interface which can be implemented on all 
platforms to perform normal file access from either the kernel module
or the libzpool library.

This allows all non-portable vnode_t usage in the common code to be 
replaced by the new portable zfs_file_t.  The associated vnode and
kobj compatibility functions, types, and macros have been removed
from the SPL.  Moving forward, vnodes should only be used in platform
specific code when provided by the native operating system.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9556
2019-11-21 09:32:57 -08:00
Brian Behlendorf
7ae3f8dc8f
Partially revert 5a6ac4c
Reinstate the zpl_revalidate() functionality to resolve a regression
where dentries for open files during a rollback are not invalidated.

The unrelated functionality for automatically unmounting .zfs/snapshots
was not reverted.  Nor was the addition of shrink_dcache_sb() to the
zfs_resume_fs() function.

This issue was not immediately caught by the CI because the test case
intended to catch it was included in the list of ZTS tests which may
occasionally fail for unrelated reasons.  Remove all of the rollback
tests from this list to help identify the frequency of any spurious
failures.

The rollback_003_pos.ksh test case exposes a real issue with the
long standing code which needs to be investigated.  Regardless,
it has been enable with a small workaround in the test case itself.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9587
Closes #9592
2019-11-18 13:05:56 -08:00
Heitor Alves de Siqueira
41e1aa2a06 Break out of zfs_zget early if unlinked znode
If zp->z_unlinked is set, we're working with a znode that has been
marked for deletion. If that's the case, we can skip the "goto again"
loop and return ENOENT, as the znode should not be discovered.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com>
Closes #9583
2019-11-15 09:56:05 -08:00
loli10K
7ba964cc3f Prevent NULL pointer dereference in blkg_tryget() on EL8 kernels
blkg_tryget() as shipped in EL8 kernels does not seem to handle NULL
@blkg as input; this is different from its mainline counterpart where
NULL is accepted.  To prevent dereferencing a NULL pointer when dealing
with block devices which do not set a root_blkg on the request queue
perform the NULL check in vdev_bio_associate_blkg().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9546 
Closes #9577
2019-11-13 10:19:06 -08:00
Michael Niewöhner
8aaa10a9a0 Check for __GFP_RECLAIM instead of GFP_KERNEL
Check for __GFP_RECLAIM instead of GFP_KERNEL because zfs modifies
IO and FS flags which breaks the check for GFP_KERNEL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9034
2019-11-13 10:05:23 -08:00
Michael Niewöhner
6d948c3519 Add kmem_cache flag for forcing kvmalloc
This adds a new KMC_KVMEM flag was added to enforce use of the
kvmalloc allocator in kmem_cache_create even for large blocks, which
may also increase performance in some specific cases (e.g. zstd), too.

Default to KVMEM instead of VMEM in spl_kmem_cache_create.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9034
2019-11-13 10:05:23 -08:00
Michael Niewöhner
66955885e2 Make use of kvmalloc if available and fix vmem_alloc implementation
This patch implements use of kvmalloc for GFP_KERNEL allocations, which
may increase performance if the allocator is able to allocate physical
memory, if kvmalloc is available as a public kernel interface (since
v4.12). Otherwise it will simply fall back to virtual memory (vmalloc).

Also fix vmem_alloc implementation which can lead to slow allocations
since the first attempt with kmalloc does not make use of the noretry
flag but tells the linux kernel to retry several times before it fails.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9034
2019-11-13 10:05:10 -08:00
Michael Niewöhner
c025008df5 Add missing documentation for some KMC flags
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9034
2019-11-13 09:34:51 -08:00
Brian Behlendorf
066e825221
Linux compat: Minimum kernel version 3.10
Increase the minimum supported kernel version from 2.6.32 to 3.10.
This removes support for the following Linux enterprise distributions.

    Distribution     | Kernel | End of Life
    ---------------- | ------ | -------------
    Ubuntu 12.04 LTS | 3.2    | Apr 28, 2017
    SLES 11          | 3.0    | Mar 32, 2019
    RHEL / CentOS 6  | 2.6.32 | Nov 30, 2020

The following changes were made as part of removing support.

* Updated `configure` to enforce a minimum kernel version as
  specified in the META file (Linux-Minimum: 3.10).

    configure: error:
        *** Cannot build against kernel version 2.6.32.
        *** The minimum supported kernel version is 3.10.

* Removed all `configure` kABI checks and matching C code for
  interfaces which solely predate the Linux 3.10 kernel.

* Updated all `configure` kABI checks to fail when an interface is
  missing which was in the 3.10 kernel up to the latest 5.1 kernel.
  Removed the HAVE_* preprocessor defines for these checks and
  updated the code to unconditionally use the verified interface.

* Inverted the detection logic in several kABI checks to match
  the new interface as it appears in 3.10 and newer and not the
  legacy interface.

* Consolidated the following checks in to individual files. Due
  the large number of changes in the checks it made sense to handle
  this now.  It would be desirable to group other related checks in
  the same fashion, but this as left as future work.

  - config/kernel-blkdev.m4 - Block device kABI checks
  - config/kernel-blk-queue.m4 - Block queue kABI checks
  - config/kernel-bio.m4 - Bio interface kABI checks

* Removed the kABI checks for sops->nr_cached_objects() and
  sops->free_cached_objects().  These interfaces are currently unused.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9566
2019-11-12 08:59:06 -08:00
Pavel Snajdr
5a6ac4cffc Remove zpl_revalidate
This patch removes the need for zpl_revalidate altogether.

There were 3 main reasons why we used d_revalidate:

1. periodic automounted snapshots umount deferral
2. negative dentries created before snapshot rollback
3. stale inodes referenced by dentry cache after snapshot rollback

Periodic snapshots deferral solution introduces zfs_exit_fs function,
which is called as a part of ZFS_EXIT(zfsvfs_t) macro.

Negative dentries and stale inodes are solved by flushing the dcache
for the particular dataset on zfs_resume_fs call.

This patch also removes now unused HAVE_S_D_OP configure test.

Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #8774 
Closes #9549
2019-11-11 09:34:21 -08:00
Alexander Motin
f15d6a5457 Improve logging of 128KB writes
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks.  After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record.  Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads.  It also didn't help with SLOG space
efficiency in that case.

This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes.  Writes of 68-127KB are still using
one 128KB block to not increase processing overhead.  Writes above
131KB are still using full 128KB blocks, since possible saving there
is small.  Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #9409
2019-11-11 09:27:59 -08:00
Romain Dolbeau
4254e40729 Preliminary support for RV64G
This adds basic support for RISC-V, specifically RV64G.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes #9540
2019-11-06 10:56:09 -08:00
Matthew Macy
27ece2ee4d Move platform specific parts of zfs_znode.h to platform code
Some of the znode fields are different and functions
consuming an inode don't exist on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9536
2019-11-06 10:54:25 -08:00
Prakash Surya
ae38e00968 Add tracepoints for taskq entry lifetime events
This adds some new DTRACE_PROBE* endpoints so that we can observe taskq
latencies on a system. Additionally, a new "taskqlatency.bt" script is
added to do this observation via "bpftrace". Lastly, a "zfs-trace.sh"
script is added to wrap "bpftrace" with the proper options required to
run and use "taskqlatency.bt".

For example, with these changes in place, a user can run the following:

    $ cd ./contrib/bpftrace
    $ sudo ./zfs-trace.sh taskqlatency.bt
    Attaching 6 probes...
    ^C

Here's some example output, showing latency information for time spent
executing the taskq entry's function:

    @exec_lat_us[dp_sync_taskq, userquota_updates_task]:
    [2, 4)                 5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [4, 8)                 0 |                                                    |
    [8, 16)                1 |@@@@@@@@@@                                          |
    [16, 32)               2 |@@@@@@@@@@@@@@@@@@@@                                |

    @exec_lat_us[z_wr_int_h, zio_execute]:
    [8, 16)               16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [16, 32)               2 |@@@@@@                                              |

    @exec_lat_us[z_wr_iss_h, zio_execute]:
    [16, 32)               4 |@@@@@@@@@@@@@@@@                                    |
    [32, 64)              13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [64, 128)              1 |@@@@                                                |

    @exec_lat_us[z_ioctl_int, zio_execute]:
    [2, 4)                 1 |@@@@                                                |
    [4, 8)                11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [8, 16)                8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |

    @exec_lat_us[dp_sync_taskq, sync_dnodes_task]:
    [2, 4)                 1 |@@@@@@                                              |
    [4, 8)                 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
    [8, 16)                8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [16, 32)               2 |@@@@@@@@@@@@@                                       |
    [32, 64)               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
    [64, 128)              1 |@@@@@@                                              |
    [128, 256)             0 |                                                    |
    [256, 512)             1 |@@@@@@

Here's some example output, showing latency information for time spent
waiting on the taskq, prior to starting execution of entry's function:

    @queue_lat_us[dp_sync_taskq]:
    [2, 4)                 1 |@@@@                                                |
    [4, 8)                 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
    [8, 16)                2 |@@@@@@@@                                            |
    [16, 32)               3 |@@@@@@@@@@@@@                                       |
    [32, 64)              12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [64, 128)              6 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
    [128, 256)             0 |                                                    |
    [256, 512)             1 |@@@@                                                |

    @queue_lat_us[z_wr_iss]:
    [4, 8)                 4 |@@@@                                                |
    [8, 16)               13 |@@@@@@@@@@@@@@@                                     |
    [16, 32)               6 |@@@@@@@                                             |
    [32, 64)               2 |@@                                                  |
    [64, 128)             12 |@@@@@@@@@@@@@@                                      |
    [128, 256)            15 |@@@@@@@@@@@@@@@@@@                                  |
    [256, 512)            33 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
    [512, 1K)             27 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
    [1K, 2K)               7 |@@@@@@@@                                            |
    [2K, 4K)              14 |@@@@@@@@@@@@@@@@                                    |
    [4K, 8K)              14 |@@@@@@@@@@@@@@@@                                    |
    [8K, 16K)             23 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
    [16K, 32K)            43 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

    @queue_lat_us[z_wr_int]:
    [2, 4)                10 |@@@@@                                               |
    [4, 8)                71 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           |
    [8, 16)               88 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [16, 32)              50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
    [32, 64)              65 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              |
    [64, 128)             43 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
    [128, 256)            19 |@@@@@@@@@@@                                         |
    [256, 512)             3 |@                                                   |
    [512, 1K)              1 |                                                    |

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9525
2019-11-01 13:14:54 -07:00
Prakash Surya
e5d1c27e30 Enable use of DTRACE_PROBE* macros in "spl" module
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.

Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.

Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.

Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9525
2019-11-01 13:13:43 -07:00
Matthew Macy
4a2ed90013 Wrap Linux module macros
MODULE_VERSION is already defined on FreeBSD. Wrap all of the
used MODULE_* macros for the sake of consistency and portability.

Add a user space noop version to reduce the need for _KERNEL ifdefs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9542
2019-11-01 10:41:03 -07:00
Matthew Macy
bd4dde8ef7 Prefix struct rangelock
A struct rangelock already exists on FreeBSD.  Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
This change is a follow up to 2cc479d0.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9534
2019-11-01 10:37:33 -07:00
Matthew Macy
32682b0c03 Fix icp build on FreeBSD
- ROTATE_LEFT is not used by amd64, move it down within
  the scope it's used to silence a clang warning.

- __unused is an alias for the compiler annotation
  __attribute__((__unused__)) on FreeBSD.  Rename the
  field to ____unused.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9538
2019-11-01 10:27:53 -07:00
Matthew Macy
156f74fc03 Return an error code from zfs_acl_chmod_setattr
The FreeBSD implementation can fail, allow this function to
fail and add the required error handling for Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9541
2019-11-01 10:19:11 -07:00
Matthew Macy
59055a0164 Include prototypes for vdev_initialize
Address two prototype related warnings emitted by clang.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9535
2019-10-31 10:09:01 -07:00
Matthew Macy
2a3aa5a109 Factor Linux specific code out of spa_misc.c
Move these Linux module parameter get/set helpers in to
platform specific code.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9457
2019-10-31 09:52:22 -07:00
Romain Dolbeau
0b2a642351 Add AVX512BW variant of fletcher
It is much faster than AVX512F when byteswapping on Skylake-SP
and newer, as we can do the byteswap in a single vshufb instead
of many instructions.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #9517
2019-10-30 12:26:14 -07:00
Tom Caputi
bae11ba8dc Fix 'zfs change-key' with unencrypted child
Currently, when you call 'zfs change-key' on an encrypted dataset
that has an unencrypted child, the code will trigger a VERIFY.
This VERIFY is leftover from before we allowed unencrypted
datasets to exist underneath encrypted ones. This patch fixes the
issue by simply replacing the VERIFY with an early return when
recursing through datasets.

Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9524
2019-10-30 11:27:28 -07:00
Matthew Macy
4a22ba5be0 Minor spa portability fixes
- FreeBSD's rootpool import code uses spa_config_parse
- Move the zvol_create_minors call out from under the
  spa_namespace_lock in spa_import. It isn't needed and it causes
  a lock order reversal on FreeBSD.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9499
2019-10-28 09:51:53 -07:00
loli10K
e35704647e Fix for ARC sysctls ignored at runtime
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.

A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9487  
Closes #9489
2019-10-26 15:22:19 -07:00
Matthew Macy
0ee89a1252 Remove non-portable pointer is valid assert
This assert makes non portable assumptions about the state of memory
returned by the memory allocator.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9506
2019-10-25 13:46:07 -07:00
Matthew Macy
c392c5aec0 Move final zvol_remove_minors to common code
This logic is not platform dependent and should reside in the
common code.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9505
2019-10-25 13:42:54 -07:00
Matthew Macy
68a1b1589a Remove sdt.h
It's mostly a noop on ZoL and it conflicts with platforms that 
support dtrace.  Remove this header to resolve the conflict.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9497
2019-10-25 13:38:37 -07:00
Brian Behlendorf
10fa254539
Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9346
Closes #9403
2019-10-24 10:17:33 -07:00
chrisrd
05d07ba9a7 Don't call arc_buf_destroy on unallocated arc_buf
Fixes an obvious issue of calling arc_buf_destroy() on an
unallocated arc_buf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #9453
2019-10-23 15:26:17 -07:00
Serapheim Dimitropoulos
851eda3566 Update skc_obj_alloc for spl kmem caches that are backed by Linux
Currently, for certain sizes and classes of allocations we use
SPL caches that are backed by caches in the Linux Slab allocator
to reduce fragmentation and increase utilization of memory. The
way things are implemented for these caches as of now though is
that we don't keep any statistics of the allocations that we
make from these caches.

This patch enables the tracking of allocated objects in those
SPL caches by making the trade-off of grabbing the cache lock
at every object allocation and free to update the respective
counter.

Additionally, this patch makes those caches visible in the
/proc/spl/kmem/slab special file.

As a side note, enabling the specific counter for those caches
enables SDB to create a more user-friendly interface than
/proc/spl/kmem/slab that can also cross-reference data from
slabinfo. Here is for example the output of one of those
caches in SDB that outputs the name of the underlying Linux
cache, the memory of SPL objects allocated in that cache,
and the percentage of those objects compared to all the
objects in it:
```
> spl_kmem_caches | filter obj.skc_name == "zio_buf_512" | pp
name        ...            source total_memory util
----------- ... ----------------- ------------ ----
zio_buf_512 ... kmalloc-512[SLUB]       16.9MB    8
```

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9474
2019-10-18 13:24:28 -04:00
Matthew Macy
c9c9c1e213 OpenZFS restructuring - ARC memory pressure
Factor Linux specific memory pressure handling out of ARC.  Each
platform will have different available interfaces for managing memory
pressure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9472
2019-10-18 13:23:19 -04:00
Matthew Macy
08f530c699 Make zfsdev_getminor signature cross platform
Only pass the file descriptor to make zfsdev_get_miror() portable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9466
2019-10-16 18:43:52 -07:00
Matthew Macy
0e939e434a Move linux specific mmp module_param_call handler to platform code
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9465
2019-10-16 18:37:31 -07:00
Matthew Macy
cf2eba666e Add default case to lua kernel code
Some platforms, e.g. FreeBSD, support user space setjmp
semantics in kernel.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9450
2019-10-16 18:36:15 -07:00
Matthew Macy
47d57dbccf Fix signature for private functions without header declarations
Clang will complain if a function has no prior declaration

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9467
2019-10-16 18:34:19 -07:00
Matthew Macy
177c79dfbe Remove dead code and cleanup scoping in dmu_send.c
This addresses a number of problems with dmu_send.c:

* bp_span is unused which makes clang complain
* dump_write conflicts with FreeBSD's existing core dump code
* range_alloc is private to the file and not declared in any headers
  causing clang to complain

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9432
2019-10-13 19:25:19 -07:00
Paul Dagnelie
511fce6b1f Don't call sizeof on void
We get the sizeof the appropriate type, and don't cast away const.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9455
2019-10-13 19:21:51 -07:00
Matthew Macy
cdbba101f4 Move zfs_onexit_fd_hold to platform code
FreeBSD has a very different implementation.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9442
2019-10-13 19:19:39 -07:00
Matthew Macy
c324701332 Move zio_delay_interrupt to platform code
FreeBSD has its own implementation as do other platforms.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9439
2019-10-13 19:15:27 -07:00
chrisrd
2c6fa6eafb Typo fix in comment: dso_dryrun
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #9452
2019-10-11 10:33:13 -07:00
Paul Dagnelie
516a83f886 Function name and comment updates
Rename certain functions for more consistency when they share common 
features. Make comments clearer about what arguments should be passed 
to the insert and add functions.

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9441
2019-10-11 10:13:21 -07:00
Matthew Macy
bce795ad7a Remove linux/mod_compat.h from common code
It is no longer necessary; mod_compat.h is included from zfs_context.h.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9449
2019-10-11 10:10:20 -07:00
Matthew Macy
af1698f59b Expose dmu_buf_hold_array_by_dnode to platform code
FreeBSD uses this in its pager ops routines

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9431
2019-10-11 10:06:18 -07:00
loli10K
715c996d3b Fix pool creation with feature@allocation_classes disabled
When "feature@allocation_classes" is not enabled on the pool no vdev
with "special" or "dedup" allocation type should be allowed to exist in
the vdev tree.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9427 
Closes #9429
2019-10-10 16:39:41 -07:00
Matthew Macy
2516a87821 Move get_temporary_prop to platform code
Temporary property handling at the VFS layer requires
platform specific code.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9401
2019-10-10 15:59:34 -07:00
Matthew Macy
6501906280 Add kmem cache accessors
Make the metaslab platform agnostic again by adding
accessor functions which can be implemented by each
platform.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9404
2019-10-10 15:45:52 -07:00
Matthew Macy
eedb3a62b9 Make zil_async_to_sync visible to platform code
FreeBSD's zvol platform code requires access to the
zil_async_to_sync() function.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9440
2019-10-10 15:39:44 -07:00
Matthew Macy
e4f5fa1229 Fix strdup conflict on other platforms
In the FreeBSD kernel the strdup signature is:

```
char	*strdup(const char *__restrict, struct malloc_type *);
```

It's unfortunate that the developers have chosen to change
the signature of libc functions - but it's what I have to
deal with.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9433
2019-10-10 09:47:06 -07:00
Matthew Macy
c5858ff946 Make clang happy with vdev_raidz_ code
The macros are used to generate code for conditions without a
corresponding branch. This is not a problem in practice, but
clang has no way of knowing that. Add a default branch with a
VERIFY(0) to indicate that it "can't happen"

```
In file included from \
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_sse2.c:607:
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_impl.h:281:3: \
error: no case matching constant switch condition '3' [-Werror]
```

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9434
2019-10-10 09:45:37 -07:00
Paul Dagnelie
ca5777793e Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to 
store range trees more efficiently.

The new structure is approximately a B-tree, though there are some 
small differences from the usual characterizations. The tree has core 
nodes and leaf nodes; each contain data elements, which the elements 
in the core nodes acting as separators between its children. The 
difference between core and leaf nodes is that the core nodes have an 
array of children, while leaf nodes don't. Every node in the tree may 
be only partially full; in most cases, they are all at least 50% full 
(in terms of element count) except for the root node, which can be 
less full. Underfull nodes will steal from their neighbors or merge to 
remain full enough, while overfull nodes will split in two. The data 
elements are contained in tree-controlled buffers; they are copied 
into these on insertion, and overwritten on deletion. This means that 
the elements are not independently allocated, which reduces overhead, 
but also means they can't be shared between trees (and also that 
pointers to them are only valid until a side-effectful tree operation 
occurs). The overhead varies based on how dense the tree is, but is 
usually on the order of about 50% of the element size; the per-node 
overheads are very small, and so don't make a significant difference. 
The trees can accept arbitrary records; they accept a size and a 
comparator to allow them to be used for a variety of purposes.

The new trees replace the AVL trees used in the range trees today. 
Currently, the range_seg_t structure contains three 8 byte integers 
of payload and two 24 byte avl_tree_node_ts to handle its storage in 
both an offset-sorted tree and a size-sorted tree (total size: 64 
bytes). In the new model, the range seg structures are usually two 4 
byte integers, but a separate one needs to exist for the size-sorted 
and offset-sorted tree. Between the raw size, the 50% overhead, and 
the double storage, the new btrees are expected to use 8*1.5*2 = 24 
bytes per record, or 33.3% as much memory as the AVL trees (this is 
for the purposes of storing metaslab range trees; for other purposes, 
like scrubs, they use ~50% as much memory).

We reduced the size of the payload in the range segments by teaching 
range trees about starting offsets and shifts; since metaslabs have a 
fixed starting offset, and they all operate in terms of disk sectors, 
we can store the ranges using 4-byte integers as long as the size of 
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle 
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not 
anticipate disks of this size in the near future, there should be 
almost no cases where metaslabs need 64-byte integers to store their 
ranges. We do still have the capability to store 64-byte integer ranges 
to account for cases where we are storing per-vdev (or per-dnode) trees, 
which could reasonably go above the limits discussed. We also do not 
store fill information in the compact version of the node, since it 
is only used for sorted scrub.

We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than 
they are for the avl tree, remove usually requires a find operation, 
while in the AVL tree model the element itself suffices. Some clever 
changes actually caused an overall speedup in metaslab loading; we use 
approximately 40% less cpu to load metaslabs in our tests on Illumos.

Another memory and performance optimization was achieved by changing 
what is stored in the size-sorted trees. When a disk is heavily 
fragmented, the df algorithm used by default in ZFS will almost always 
find a number of small regions in its initial cursor-based search; it 
will usually only fall back to the size-sorted tree to find larger 
regions. If we increase the size of the cursor-based search slightly, 
and don't store segments that are smaller than a tunable size floor 
in the size-sorted tree, we can further cut memory usage down to 
below 20% of what the AVL trees store. This also results in further 
reductions in CPU time spent loading metaslabs.

The 16KiB size floor was chosen because it results in substantial memory 
usage reduction while not usually resulting in situations where we can't 
find an appropriate chunk with the cursor and are forced to use an 
oversized chunk from the size-sorted tree. In addition, even if we do 
have to use an oversized chunk from the size-sorted tree, the chunk 
would be too small to use for ZIL allocations, so it isn't as big of a 
loss as it might otherwise be. And often, more small allocations will 
follow the initial one, and the cursor search will now find the 
remainder of the chunk we didn't use all of and use it for subsequent 
allocations. Practical testing has shown little or no change in 
fragmentation as a result of this change.

If the size-sorted tree becomes empty while the offset sorted one still 
has entries, it will load all the entries from the offset sorted tree 
and disregard the size floor until it is unloaded again. This operation 
occurs rarely with the default setting, only on incredibly thoroughly 
fragmented pools.

There are some other small changes to zdb to teach it to handle btrees, 
but nothing major.
                                           
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 10:36:03 -07:00
George Melikov
7b50929851 module/Makefile.in: don't run xargs if empty
If stdin if empty - don't run xargs command,
otherwise we can get `cp: missing file operand`
error.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9418
2019-10-08 10:10:23 -07:00
Brian Behlendorf
6bd4f4545d
Fix automount for root filesystems
Commit 093bb64 resolved an automount failures for chroot'd processes
but inadvertently broke automounting for root filesystems where the
vfs_mntpoint is NULL.  Resolve the issue by checking for NULL in order
to generate the correct path.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9381
Closes #9384
2019-10-04 12:30:51 -07:00
Matthew Macy
2cc479d049 Rename rangelock_ functions to zfs_rangelock_
A rangelock KPI already exists on FreeBSD.  Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9402
2019-10-03 15:54:29 -07:00
Tony Nguyen
64b6c47d90 dbuf_hold_impl() cleanup to improve cached read performance
Currently every dbuf_hold_impl() incurs kmem_alloc() and kmem_free()
which can be costly for cached read performance.

This change reverts the dbuf_hold_impl() fix stack commit, i.e.
fc5bb51f08 to eliminate the extra
kmem_alloc() and kmem_free() operations and improve cached read
performance. With the change, each dbuf_hold_impl() frame uses 40 bytes
more, total of 800 for 20 recursive levels. Linux kernel stack sizes are
8K and 16K for 32bit and 64bit, respectively, so stack overrun risk is
limited.

Sample stack output comparisons with 50 PB file and recordsize=512
Current code
 11)     2240      64   arc_alloc_buf+0x4a/0xd0 [zfs]
 12)     2176     264   dbuf_read_impl.constprop.16+0x2e3/0x7f0 [zfs]
 13)     1912     120   dbuf_read+0xe5/0x520 [zfs]
 14)     1792      56   dbuf_hold_impl_arg+0x572/0x630 [zfs]
 15)     1736      64   dbuf_hold_impl_arg+0x508/0x630 [zfs]
 16)     1672      64   dbuf_hold_impl_arg+0x508/0x630 [zfs]
 17)     1608      40   dbuf_hold_impl+0x23/0x40 [zfs]
 18)     1568      40   dbuf_hold_level+0x32/0x60 [zfs]
 19)     1528      16   dbuf_hold+0x16/0x20 [zfs]

dbuf_hold_impl() cleanup
 11)     2320      64   arc_alloc_buf+0x4a/0xd0 [zfs]
 12)     2256     264   dbuf_read_impl.constprop.17+0x2e3/0x7f0 [zfs]
 13)     1992     120   dbuf_read+0xe5/0x520 [zfs]
 14)     1872      96   dbuf_hold_impl+0x50f/0x5e0 [zfs]
 15)     1776     104   dbuf_hold_impl+0x4df/0x5e0 [zfs]
 16)     1672     104   dbuf_hold_impl+0x4df/0x5e0 [zfs]
 17)     1568      40   dbuf_hold_level+0x32/0x60 [zfs]
 18)     1528      16   dbuf_hold+0x16/0x20 [zfs]

Performance observations on 8K recordsize filesystem:
- 8/128/1024K at 1-128 sequential cached read, ~3% improvement

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #9351
2019-10-03 15:33:38 -07:00
Matthew Macy
6360e2779e Add inode accessors to common code
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9389
2019-10-02 09:15:12 -07:00
Matthew Macy
13a4027a7c OpenZFS restructuring - arc_stats
Make arc_stats visible to platform code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9386
2019-10-01 16:35:05 -07:00
Matthew Macy
7111c86ca3 Enable clang to use intrinsics for lz4
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9385
2019-10-01 13:17:32 -07:00
Prakash Surya
99573cc053 Timeout waiting for ZVOL device to be created
We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.

The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.

When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:

Function = zvol_remove_minors_impl
msecs               : count     distribution
  0 -> 1          : 0        |                                        |
  2 -> 3          : 0        |                                        |
  4 -> 7          : 1        |                                        |
  8 -> 15         : 0        |                                        |
 16 -> 31         : 0        |                                        |
 32 -> 63         : 0        |                                        |
 64 -> 127        : 1        |                                        |
128 -> 255        : 45       |****************************************|
256 -> 511        : 5        |****                                    |

That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.

As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.

When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.

As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9380
2019-10-01 12:33:12 -07:00
Matthew Macy
7bb0c29468 OpenZFS restructuring - zfs_ioctl
Refactor the zfs ioctls in to platform dependent and independent bits.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9301
2019-09-27 10:46:28 -07:00
Brian Behlendorf
f81d5ef686
Add warning for zfs_vdev_elevator option removal
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices.  At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.
While well intentioned this introduced a bug which could cause a
system hang, that issue was subsequently fixed by commit 2a0d4188.

In order to avoid future bugs in this area, and to simplify the code,
this functionality is being deprecated.  A console warning has been
added to notify any existing consumers and the documentation updated
accordingly.  This option will remain for the lifetime of the 0.8.x
series for compatibility but if planned to be phased out of master.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8664
Closes #9317
2019-09-25 09:23:29 -07:00
Matthew Macy
5df7e9d85c OpenZFS restructuring - zvol
Refactor the zvol in to platform dependent and independent bits.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9295
2019-09-25 09:20:30 -07:00
loli10K
d359e99c38 diff_cb() does not handle large dnodes
Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try
to access its interior slots when dnodesize > sizeof(dnode_phys_t).
This is normally not an issue because the interior slots are
zero-filled, which report_dnode() handles calling
report_free_dnode_range(). However this is not the case for encrypted
large dnodes or filesystem using many SA based xattrs where the extra
data past the legacy dnode size boundary is interpreted as a
dnode_phys_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7678 
Closes #8931 
Closes #9343
2019-09-24 12:01:37 -07:00
Kody A Kantor
d49d7336dd Disabled resilver_defer feature leads to looping resilvers
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes #9299 
Closes #9338
2019-09-22 15:25:39 -07:00
Andriy Gapon
dd262c9681 Fix dsl_scan_ds_clone_swapped logic
The was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.

In both cases, if ds1 was already present, then it would be first
replaced with ds2 and then ds would be replaced back with ds1.
Also, both cases did not properly handle a situation where both ds1 and
ds2 are already queued.  A duplicate insertion would be attempted and
its failure would result in a panic.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9140 
Closes #9163
2019-09-18 09:04:45 -07:00
loli10K
fcd37b622b Device removal of indirect vdev panics the kernel
This commit fixes a NULL pointer dereference triggered in
spa_vdev_remove_top_check() by trying to "zpool remove" an indirect
vdev.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9327
2019-09-16 10:46:59 -07:00
loli10K
b24771a8c9 Prevent gcc -Werror=maybe-uninitialized warnings in spa_wait_common()
This commit fixes the following build failure detected on Debian9
(GCC 6.3.0):

     CC [M]  module/zfs/spa.o
   module/zfs/spa.c: In function ‘spa_wait_common.part.31’:
   module/zfs/spa.c:9468:6: error: ‘in_progress’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
      if (!in_progress || spa->spa_waiters_cancel || error)
         ^
   cc1: all warnings being treated as errors

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9326
2019-09-16 10:46:02 -07:00
Tom Caputi
637f0c6019 Fix clone handling with encryption roots
Currently, spa_keystore_change_key_sync_impl() does not recurse
into clones when updating encryption roots for either a call to
'zfs promote' or 'zfs change-key'. This can cause children of
these clones to end up in a state where they point to the wrong
dataset as the encryption root. It can also trigger ASSERTs in
some cases where the code checks reference counts on wrapping
keys. This patch fixes this issue by ensuring that this function
properly recurses into clones during processing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9267 
Closes #9294
2019-09-16 10:07:33 -07:00
loli10K
2a0d41889e Scrubbing root pools may deadlock on kernels without elevator_change() (#9321)
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue #8664 
Closes #9321
2019-09-13 18:09:59 -07:00
John Gallagher
e60e158eff Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.

This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:

 - Scrubs or resilvers to complete
 - Devices to initialized
 - Devices to be replaced
 - Devices to be removed
 - Checkpoints to be discarded
 - Background freeing to complete

For example, a scrub that is in progress could be waited for by running

    zpool wait -t scrub <pool>

This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.

This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.

Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:

 - Added ZoL-style ioctl input declaration.
 - Reorganized error handling in zpool_initialize in libzfs to integrate
   better with changes made for TRIM support.
 - Fixed check for whether a checkpoint discard is in progress.
   Previously it also waited if the pool had a checkpoint, instead of
   just if a checkpoint was being discarded.
 - Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
 - Updated more existing tests to make use of new 'zpool wait'
   functionality, tests that don't exist in Delphix OS.
 - Used existing ZoL tunable zfs_scan_suspend_progress, together with
   zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
 - Added support for a non-integral interval argument to zpool wait.

Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-13 18:09:06 -07:00
Chengfei ZHu
7238cbd4d3 QAT related bug fixes
1. Fix issue:  Kernel BUG with QAT during decompression  #9276.
   Now it is uninterruptible for a specific given QAT request,
   but Ctrl-C interrupt still works in user-space process.

2. Copy the digest result to the buffer only when doing encryption,
   and vise-versa for decryption.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com>
Closes #9276 
Closes #9303
2019-09-12 13:33:44 -07:00
Matthew Macy
b01a6574ae Move objnode handling to common code
objnode is OS agnostic and used only by dmu_redact.c.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9315
2019-09-12 13:31:09 -07:00
Matthew Macy
74756182d2 Enable compiler to typecheck logging
Annotate spa logging declarations with printflike
Workaround gcc bug (non disable-able warning) by
replacing "" with " "

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9316
2019-09-12 13:28:26 -07:00
Matthew Macy
d66620681d OpenZFS restructuring - move linux tracing code to platform directories
Move Linux specific tracing headers and source to platform directories
and update the build system.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9290
2019-09-11 14:25:53 -07:00
Tom Caputi
5815f7ac30 Fix stalled txg with repeated noop scans
Currently, the DSL scan code figures out when it should suspend
processing and allow a txg to continue by calling the function
dsl_scan_check_suspend(). Unfortunately, this function only
allows the scan to suspend at a level 0 block. In the event that
the system is scanning a bunch of empty snapshots or a resilver
is running with a high enough scn_cur_min_txg, the scan will
stop processing each dataset at the root level, deciding it
has nothing left to do. This means that the check_suspend
function is never called and the txg remains stuck until a
dataset is found that has data to scan.

This patch fixes the problem by allowing scans to suspend at
the root level of the objset. For backwards compatibility, we
use the bookmark <objsetid, 0, 0, 0> when we suspend here so
that older versions of the code will work as intended.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9300
2019-09-11 11:16:48 -07:00
Brian Behlendorf
25f06d677a
Fix /etc/hostid on root pool deadlock
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify().  A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool.  This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer.  For example, to perform a
`zpool attach` as shown in the abbreviated stack below.

To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property.  The cached
value is then relied upon for subsequent accesses.

Call Trace:
    spa_config_enter+0x1e8/0x350 [zfs]
    zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
    zio_read+0x6c/0x140 [zfs]
    ...
    vfs_read+0xfc/0x1e0
    kernel_read+0x50/0x90
    ...
    spa_get_hostid+0x1c/0x38 [zfs]
    spa_config_generate+0x1a0/0x610 [zfs]
    vdev_label_init+0xa0/0xc80 [zfs]
    vdev_create+0x98/0xe0 [zfs]
    spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9256 
Closes #9285
2019-09-10 13:42:30 -07:00
Brian Behlendorf
b88ca2acf5
Enable SIMD for encryption
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9215
Closes #9296
2019-09-10 10:45:46 -07:00
Matthew Macy
bced7e3aaa OpenZFS restructuring - move platform specific sources
Move platform specific Linux source under module/os/linux/
and update the build system accordingly.  Additional code
restructuring will follow to make the common code fully
portable.
    
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Closes #9206
2019-09-06 11:26:26 -07:00
Matthew Macy
03fdcb9adc Make module tunables cross platform
Adds ZFS_MODULE_PARAM to abstract module parameter
setting to operating systems other than Linux.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9230
2019-09-05 14:49:49 -07:00
Serapheim Dimitropoulos
65a91b166e metaslab_verify_weight_and_frag() shouldn't cause side-effects
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.

The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).

This patch adds a new flag as a parameter to `metaslab_weight()` and
`metaslab_set_fragmentation()` making the dirtying of the metaslab
optional.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9185 
Closes #9282
2019-09-05 09:57:55 -07:00
Matthew Macy
006e9a4088 OpenZFS restructuring - move platform specific headers
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.

As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context.  No functional change, the source has been
refactored and the relevant #include's updated.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9198
2019-09-05 09:34:54 -07:00
Igor K
e242b67cee Fix panic on DilOS with kstat per dataset statistics
Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size.  This value
is ignored in the Linux kstat code but resolves the issue for other
platforms.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #9254 
Closes #9151
2019-09-03 12:12:31 -07:00
Andriy Gapon
ebeb6f23bf Always refuse receving non-resume stream when resume state exists
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).

Additionally, distinguish incremental and resume streams in error
messages.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9252
2019-09-03 10:56:55 -07:00
loli10K
6988f3ed9a Fix Intel QAT / ZFS compatibility on v4.7.1+ kernels
This change use the compat code introduced in 9cc1844a.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9268 
Closes #9269
2019-09-03 10:36:33 -07:00
George Wilson
1e52716257 maxinflight can overflow in spa_load_verify_cb()
When running on larger memory systems, we can overflow the value of
maxinflight. This can result in maxinflight having a value of 0 causing
the system to hang.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #9272
2019-09-02 19:17:51 -07:00
Andrea Gelmini
e1cfd73f7f Fix typos in module/zfs/
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9240
2019-09-02 17:56:41 -07:00
Andrea Gelmini
9f5c1bc609 Fix typos in module/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9241
2019-08-30 14:32:18 -07:00
Andrea Gelmini
9d40bdf414 Fix typos in modules/icp/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9239
2019-08-30 14:26:07 -07:00
Paul Dagnelie
475aa97cab Prevent metaslab_sync panic due to spa_final_dirty_txg
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9185
Closes #9186
Closes #9231
Closes #9253
2019-08-30 09:28:31 -07:00
Paul Dagnelie
eef0f4d84e Keep more metaslabs loaded
With the other metaslab changes loaded onto a system, we can 
significantly reduce the memory usage of each loaded metaslab and 
unload them on demand if there is memory pressure. However, none 
of those changes actually result in us keeping more metaslabs loaded. 
If we don't keep more metaslabs loaded, we will still have to wait 
for demand-loading to finish when no loaded metaslab can satisfy our 
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while 
unloading is currently done on a txg-count basis. Txgs can take a 
widely varying range of times, from tenths of a second to several 
seconds. This can result in confusing, hard to predict behavior.

This change simply adds a time-based component to metaslab unloading. 
A metaslab will remain loaded for one minute and 8 txgs (by default) 
after it was last used, unless it is evicted due to memory pressure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-65016
External-issue: DLPX-65047
Closes #9197
2019-08-29 10:20:36 -07:00
Tony Nguyen
8d04284281 Use smaller default slack/delta value for schedule_hrtimeout_range()
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.

This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.

Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
  and smaller gains as thread count increases. At >64 threads, 2-5%
  decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
  leveling out around 64 threads. At >64 threads, 5-10% decrease in
  performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
  high thread count.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #9217
2019-08-28 14:56:54 -07:00
Don Brady
28c91ab66d Tag ABD pages for exclusion in kernel crash dumps
Tag the ABD data pages so that they can be identified for exclusion 
from kernel crash dumps. Eliminating the zfs file data allows for 
significantly smaller crash dump files. Note that ZFS in illumos has 
always excluded the zfs data pages from a kernel crash dump.

This change tags ARC scatter data pages so they can be identified from 
the makedumpfile(8) command. That command is used to create smaller 
dump files by ignoring some memory regions and using compression. It 
already filters file data from the VFS page cache and will now be able 
to exclude ZFS file data pages from the dump file.

A corresponding change to makeumpfile(8) is required to identify ZFS 
data pages.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8899
2019-08-28 10:44:46 -07:00
Chunwei Chen
035e96118b Fix zil replay panic when TX_REMOVE followed by TX_CREATE
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
2019-08-28 10:42:02 -07:00
Andriy Gapon
e6203d288a zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.

After this change, zfs_check_userprops() only validates the properties
without any permission checks.  The permissions are checked individually
for each snapshotted dataset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9179 
Closes #9180
2019-08-27 13:45:53 -07:00
Tom Caputi
e7a2fa70c3 Fix deadlock in 'zfs rollback'
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.

This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9203
2019-08-27 09:55:51 -07:00
Tony Hutter
a9ebdfdd43 Linux 5.3: Fix switch() fall though compiler errors
Fix some switch() fall-though compiler errors:

    abd.c:1504:9: error: this statement may fall through

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9170
2019-08-21 09:29:23 -07:00
Matthew Ahrens
325d288c5d Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing.  With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.

By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes #8744
2019-08-20 11:34:52 -07:00
jdike
3beb0a7694 Fix lockdep circular locking false positive involving sa_lock
There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
    zap_lockdir_impl+0x858/0x16c0 [zfs]
    zap_lockdir+0xd2/0x100 [zfs]
    zap_lookup_norm+0x7f/0x100 [zfs]
    zap_lookup+0x12/0x20 [zfs]
    sa_setup+0x902/0x1380 [zfs]
    zfsvfs_init+0x3d6/0xb20 [zfs]
    zfsvfs_create+0x5dd/0x900 [zfs]
    zfs_domount+0xa3/0xe20 [zfs]

and thread 2 trying to get sa_lock, either in sa_setup:
   sa_setup+0x742/0x1380 [zfs]
   zfsvfs_init+0x3d6/0xb20 [zfs]
   zfsvfs_create+0x5dd/0x900 [zfs]
   zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
   sa_build_index+0x13d/0x790 [zfs]
   sa_handle_get_from_db+0x368/0x500 [zfs]
   zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
   zfs_znode_alloc+0x3da/0x1a40 [zfs]
   zfs_zget+0x39a/0x6e0 [zfs]
   zfs_root+0x101/0x160 [zfs]
   zfs_domount+0x91f/0xea0 [zfs]

From there, there are different locking paths back to something
holding zap->zap_rwlock.

The deadlock scenarios involve multiple different ZFS filesystems
being mounted.  sa_lock is common to these scenarios, and the sa
struct involved is private to a mount.  Therefore, these must be
referring to different sa_lock instances and these deadlocks can't
occur in practice.

The fix, from Brian Behlendorf, is to remove sa_lock from lockdep
coverage by initializing it with MUTEX_NOLOCKDEP.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #9110
2019-08-19 16:04:26 -07:00
Dominic Pearson
ff4b68eedc Linux 5.3 compat: Makefile subdir-m no longer supported
Uses obj-m instead, due to kernel changes.

See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Dominic Pearson <dsp@technoanimal.net>
Closes #9169
2019-08-19 15:22:52 -07:00
Paul Dagnelie
f09fda5071 Cap metaslab memory usage
On systems with large amounts of storage and high fragmentation, a huge 
amount of space can be used by storing metaslab range trees. Since 
metaslabs are only unloaded during a txg sync, and only if they have 
been inactive for 8 txgs, it is possible to get into a state where all 
of the system's memory is consumed by range trees and metaslabs, and 
txgs cannot sync. While ZFS knows how to evict ARC data when needed, 
it has no such mechanism for range tree data. This can result in boot 
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing 
context. Second, we store a multilist of all loaded metaslabs, sorted 
by their selection txg, so we can quickly identify the oldest 
metaslabs.  We use a multilist to reduce lock contention during heavy 
write workloads. Finally, we add logic that will unload a metaslab 
when we're loading a new metaslab, if we're using more than a certain 
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9128
2019-08-16 09:08:21 -06:00
Serapheim Dimitropoulos
0f8ff49eb6 dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.

From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed].  Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:

[1] The waiter didn't call into any of the two functions - which I
    find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
    with?).
[2] The waiter did call in one of the above function but it passed 0 as
    the space/delta to be dirtied (or undirtied) and then the callee
    returned immediately (e.g both `dsl_pool_dirty_space()` and
    `dsl_pool_undirty_space()` return immediately when space is 0).

In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.

Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.

In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.

In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
 * We have written all of the accounted dirty data, so our
 * dp_space_towrite should now be zero.  However, some seldom-used
 * code paths do not adhere to this (e.g. dbuf_undirty(), also
 * rounding error in dbuf_write_physdone).
 * Shore up the accounting of any dirtied space now.
 */
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````

Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.

Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.

For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.

For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.

Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.

Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.

The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes #9137
2019-08-15 17:53:53 -06:00
Tony Nguyen
c8bbf7c00b Improve write performance by using dmu_read_by_dnode()
In zfs_log_write(), we can use dmu_read_by_dnode() rather than
dmu_read() thus avoiding unnecessary dnode_hold() calls.

We get a 2-5% performance gain for large sequential_writes tests, >=128K
writes to files with recordsize=8K.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #9156
2019-08-15 17:36:24 -06:00
Serapheim Dimitropoulos
0e37a0f4f3 Assert that a dnode's bonuslen never exceeds its recorded size
This patch introduces an assertion that can catch pitfalls in
development where there is a mismatch between the size of
reads and writes between a *_phys structure and its respective
in-core structure when bonus buffers are used.

This debugging-aid should be complementary to the verification
done by ztest in ztest_verify_dnode_bt().

A side to this patch is that we now clear out any extra bytes
past a bonus buffer's new size when the buffer is shrinking.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8348
2019-08-15 08:44:57 -06:00
Paul Zuchowski
e2b31b58e8 Make txg_wait_synced conditional in zfsvfs_teardown
The call to txg_wait_synced in zfsvfs_teardown should
be made conditional on the objset having dirty data.
This can prevent unnecessary txg_wait_synced during
some unmount operations.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9115
2019-08-15 08:27:13 -06:00
Paul Dagnelie
dc04a8c757 Prevent race in blkptr_verify against device removal
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run 
into a race condition where that vdev is temporarily unavailable. This 
happens when a device removal operation and the old vdev_t has been 
removed from the array, but the new indirect vdev has not yet been 
inserted.

We hold the spa_config_lock while doing our sensitive verification. 
To ensure that we don't deadlock, we only grab the lock if we don't 
have config_writer held. In addition, I had to const the tags of the 
refcounts and the spa_config_lock arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9112
2019-08-13 21:24:43 -06:00
Chunwei Chen
8e556c5ebc Fix out-of-order ZIL txtype lost on hardlinked files
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
2019-08-13 21:21:27 -06:00
Allan Jude
d2a32912b9 Mark dsl_livelist_should_disable() static
This function is not used outside of dsl_dataset.c

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #9154
2019-08-13 21:16:23 -06:00
George Wilson
c8242a96ba spa_load_verify() may consume too much memory
When a pool is imported it will scan the pool to verify the integrity 
of the data and metadata. The amount it scans will depend on the 
import flags provided. On systems with small amounts of memory or 
when importing a pool from the crash kernel, it's possible for 
spa_load_verify to issue too many I/Os that it consumes all the memory 
of the system resulting in an OOM message or a hang.

To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
2019-08-13 08:11:57 -06:00
Tomohiro Kusumi
a43570c5f3 Change boolean-like uint8_t fields in znode_t to boolean_t
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9092
2019-08-13 07:58:02 -06:00
Richard Yao
fccbd1d6e2 Drop KMC_NOEMERGENCY
This is not implemented. If it were implemented, using it would risk
deadlocks on pre-3.18 kernels. Lets just drop it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #9119
2019-08-13 07:46:12 -06:00
Serapheim Dimitropoulos
3b9edd7b17 Introduce getting holds and listing bookmarks through ZCP
Consumers of ZFS Channel Programs can now list bookmarks,
and get holds from datasets. A minor-refactoring was also
applied to distinguish between user and system properties
in ZCP.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Ported-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Dan Kimmel <dan.kimmel@delphix.com>

OpenZFS-issue: https://illumos.org/issues/8862
Closes #7902
2019-08-12 10:02:34 -07:00
Serapheim Dimitropoulos
2081db7982 Sort log spacemap tunables in alphabetical order
Beside the whole commit being a nit in reality it should
bring the diffs of the spa_log_spacemap.c source file
between ZoL and delphix/zfs to 0.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9143
2019-08-12 09:49:07 -07:00
Paul Dagnelie
c81f1790e2 Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
2019-08-05 14:34:27 -07:00
DeHackEd
99e755d653 Don't wakeup unnecessarily in 'zpool events -f'
ZED can prevent CPU's from properly sleeping.

Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #9091
2019-08-05 11:35:47 -07:00
jdike
48be0dfba1 lockdep false positive - move txg_kick() outside of ->dp_lock
This fixes a lockdep warning by breaking a link between ->tx_sync_lock
and ->dp_lock.

The deadlock envisioned by lockdep is this:
    thread 1 holds db->db_mtx and tries to get dp->dp_lock:
	dsl_pool_dirty_space+0x70/0x2d0 [zfs]
	dbuf_dirty+0x778/0x31d0 [zfs]

    thread 2 holds bpo->bpo_lock and tries to get db->db_mtx:
        dmu_buf_will_dirty_impl
        dmu_buf_will_dirty+0x6b/0x6c0 [zfs]
        bpobj_iterate_impl+0xbe6/0x1410 [zfs]

    thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock:
        bpobj_space+0x63/0x470 [zfs]
        dsl_scan_active+0x340/0x3d0 [zfs]
        txg_sync_thread+0x3f2/0x1370 [zfs]

    thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock
       txg_kick+0x61/0x420 [zfs]
       dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs]

This patch is orginally from Brian Behlendorf and slightly simplified
by me.

It breaks this cycle in thread 4 by moving the call from
dsl_pool_need_dirty_delay to txg_kick outside the section controlled
by dp->dp_lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #9094
2019-07-31 14:53:39 -07:00
Serapheim Dimitropoulos
1ba4f3e7b4 9072 handle error of zap_cursor_retrieve() for log spacemap zap
In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve()
to return errors other than the expected ENOENT (e.g. when we are at
the end of the zap). Ensure that these error cases are handled
correctly by the import path.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9074
2019-07-30 13:20:01 -07:00
Serapheim Dimitropoulos
2fcf4481a6 mismerged log spacemap comment for metaslab_verify_weight_and_frag
When the log spacemap commit was merged in ZoL, the
metaslab_verify_unflushed_changes() debugging function
was deleted as the feature was pretty much stable by
then. Unfortunately though there was a reference to
it from a comment in metaslab_verify_weight_and_frag().

This patch deletes the reference and pastes that
comment as is.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9097
2019-07-30 10:13:44 -07:00
Matthew Ahrens
0eb8ba6ab6 Improve performance by using dmu_tx_hold_*_by_dnode()
In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand.  This eliminates some calls to dnode_hold(), which can be
expensive.  This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.

We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.

This also removes some unnecessary casts that are in the area.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9081
2019-07-30 09:18:30 -07:00
Brian Behlendorf
adf495e239
Fix channel programs on s390x
When adapting the original sources for s390x the JMP_BUF_CNT was
mistakenly halved due to an incorrect assumption of the size of
a unsigned long.  They are 8 bytes for the s390x architecture.
Increase JMP_BUF_CNT accordingly.

Authored-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Colin Ian King <canonical.com>
Tested-by: Colin Ian King <canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8992
Closes #9080
2019-07-28 18:15:26 -07:00
Tomohiro Kusumi
9fb6abe5ad Implement secpolicy_vnode_setid_retain()
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.

https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9035 
Closes #9043
2019-07-26 13:52:30 -07:00
Sara Hartse
37f03da8ba Fast Clone Deletion
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.

This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.

We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8416
2019-07-26 10:54:14 -07:00
Tomohiro Kusumi
d274ac5460 Don't directly cast unsigned long to void*
Cast to uintptr_t first for portability on integer to/from pointer
conversion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9065
2019-07-25 11:59:20 -07:00
Matthew Ahrens
1ff46825e2 Replace zf_rwlock with a mutex
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex.  Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9062
2019-07-25 11:57:58 -07:00
Tomohiro Kusumi
09276fde1c Fix module_param() type for zfs_read_chunk_size
zfs_read_chunk_size is unsigned long.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9051
2019-07-19 11:23:56 -07:00
Serapheim Dimitropoulos
7f31908913 Tricky semantics of ms_max_size in metaslab_should_allocate()
metaslab_should_allocate() is used in two places:
[1] When trying to select a metaslab to allocate from
[2] When trying to allocate from a metaslab

In [2] we always expect the metaslab to be loaded, and after
the refactoring of the log spacemap changes, whenever we load
a metaslab we set ms_max_size to the biggest range in the
ms_allocatable tree. Thus, when it is used in [2], if that
field is 0, it means that the metaslab doesn't have any
segments that can be used for allocations now (though it may
have some free space but that space can be in the freeing,
freed, or deferred trees).

In [1] a metaslab can be loaded or unloaded at which point 0
can either mean the metaslab doesn't have any space or the
metaslab is just not loaded thus we go ahead and try to make
an estimation based on its weight.

The issue here is when we call the above function for [2] and
the metaslab doesn't have any allocatable space, we still go
ahead and check its ms_weight which may be out of date because
we haven't ran metaslab_sync_done() yet. At that point we are
allowing an allocation to be attempted even though we know
there is no range that is allocatable.

This patch fixes this issue by explicitly checking if the
metaslab is loaded and if it is, the ms_max_size is used.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9045
2019-07-19 11:19:50 -07:00
Serapheim Dimitropoulos
43a8536260 Race condition between spa async threads and export
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9015 
Closes #9044
2019-07-18 13:02:33 -07:00
Serapheim Dimitropoulos
1c44a5c97f hdr_recl calls zthr_wakeup() on destroyed zthr
There exists a race condition were hdr_recl() calls
zthr_wakeup() on a destroyed zthr. The timeline is the
following:

[1] hdr_recl() runs first and goes intro zthr_wakeup()
    because arc_initialized is set.
[2] arc_fini() is called by another thread, zeroes
    that flag, destroying the zthr, and goes into
    buf_init().
[3] hdr_recl() tries to enter the destroyed mutex
    and we blow up.

This patch ensures that the ARC's zthrs are not offloaded
any new work once arc_initialized is set and then destroys
them after all of the ARC state has been deleted.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9047
2019-07-18 12:55:29 -07:00
Tomohiro Kusumi
f79121d114 Fix wrong comment on zcr_blksz_{min,max}
These aren't tunable; illumos has this comment fixed in
"3742 zfs comments need cleaner, more consistent style",
so sync with that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9052
2019-07-18 12:48:46 -07:00
Brian Behlendorf
d64dd3b62a Retire unused spl_{mutex,rwlock}_{init_fini}
These functions are unused and can be removed along
with the spl-mutex.c and spl-rwlock.c source files.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-07-17 15:13:53 -07:00
Brian Behlendorf
e7a99dab2b Linux 5.3 compat: retire rw_tryupgrade()
The Linux kernel's rwsem's have never provided an interface to
allow a reader to be upgraded to a writer.  Historically, this
functionality has been implemented by a SPL wrapper function.
However, this approach depends on internal knowledge of the
rw_semaphore and is therefore rather brittle.

Since the ZFS code must always be able to fallback to rw_exit()
and rw_enter() when an rw_tryupgrade() fails; this functionality
isn't critical.  Furthermore, the only potentially performance
sensitive consumer is dmu_zfetch() and no decrease in performance
was observed with this change applied.  See the PR comments for
additional testing details.

Therefore, it is being retired to make the build more robust and
to simplify the rwlock implementation.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-07-17 15:08:56 -07:00
Brian Behlendorf
041205afee Linux 5.3 compat: rw_semaphore owner
Commit https://github.com/torvalds/linux/commit/94a9717b updated the
rwsem's owner field to contain additional flags describing the rwsem's
state.  Rather then update the wrappers to mask out these bits, the
code no longer relies on the owner stored by the kernel.  This does
increase the size of a krwlock_t but it makes the implementation
less sensitive to future kernel changes.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-07-17 15:07:46 -07:00
jdike
a649768a17 Fix lockdep recursive locking false positive in dbuf_destroy
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #8984
2019-07-17 09:18:24 -07:00
Michael Niewöhner
5784c7c36e Add missing __GFP_HIGHMEM flag to vmalloc
Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for
some 32-bit systems to make use of full available memory.
While kernel versions >=4.12-rc1 add this flag implicitly, older
kernels do not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9031
2019-07-17 09:09:22 -07:00
Tomohiro Kusumi
c8802ba08d Use zfsctl_snapshot_hold() wrapper
zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9039
2019-07-17 09:07:53 -07:00
Brian Behlendorf
8062b7686a
Minor style cleanup
Resolve an assortment of style inconsistencies including
use of white space, typos, capitalization, and line wrapping.
There is no functional change.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9030
2019-07-16 17:22:31 -07:00
Brian Behlendorf
3b03ff2276
Fix get_special_prop() build failure
The cast of the size_t returned by strlcpy() to a uint64_t by the
VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE
is set.  This is due to the additional hardening.  Since the token
is expected to always fit in strval the VERIFY3U has been removed.
If somehow it doesn't, it will still be safely truncated.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8999
Closes #9020
2019-07-16 14:14:12 -07:00
Serapheim Dimitropoulos
93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00
Tomohiro Kusumi
6993e01202 Drop redundant POSIX ACL check in zpl_init_acl()
ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(),
so no need to test again on POSIX ACL access.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9009
2019-07-15 16:26:52 -07:00
Brian Behlendorf
64f3d39ae4
Export dnode symbols
External consumers such as Lustre require access to the dnode
interfaces in order to correctly manipulate dnodes.

Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8994
Closes #9027
2019-07-15 16:11:55 -07:00
Tom Caputi
9949b856a0 Ensure dsl_destroy_head() decrypts objsets
This patch corrects a small issue where the dsl_destroy_head()
code that runs when the async_destroy feature is disabled would
not properly decrypt the dataset before beginning processing.
If the dataset is not able to be decrypted, the optimization
code now simply does not run and the dataset is completely
destroyed in the DSL sync task.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9021
2019-07-15 16:08:42 -07:00
Tomohiro Kusumi
ff9630d1a8 Disable unused pathname::pn_path* (unneeded in Linux)
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.

Saves stack a bit on lookup and unlink.

(#if0'd members instead of removing since comments refer to them.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9025
2019-07-15 13:57:56 -07:00
Brian Behlendorf
e5db313494
Linux 5.0 compat: SIMD compatibility
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels.  This is accomplished by leveraging
the fact that by definition dedicated kernel threads never need
to concern themselves with saving and restoring the user FPU state.
Therefore, they may use the FPU as long as we can guarantee user
tasks always restore their FPU state before context switching back
to user space.

For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used.  All non-kernel
threads will restore the preserved user FPU state.

For 5.2 and latter kernels the user FPU state restoration will be
skipped if the kernel determines the registers have not changed.
Therefore, for these kernels we need to perform the additional
step of saving and restoring the FPU registers.  Invalidating the
per-cpu global tracking the FPU state would force a restore but
that functionality is private to the core x86 FPU implementation
and unavailable.

In practice, restricting SIMD to kernel threads is not a major
restriction for ZFS.  The vast majority of SIMD operations are
already performed by the IO pipeline.  The remaining cases are
relatively infrequent and can be handled by the generic code
without significant impact.  The two most noteworthy cases are:

  1) Decrypting the wrapping key for an encrypted dataset,
     i.e. `zfs load-key`.  All other encryption and decryption
     operations will use the SIMD optimized implementations.

  2) Generating the payload checksums for a `zfs send` stream.

In order to avoid making any changes to the higher layers of ZFS
all of the `*_get_ops()` functions were updated to take in to
consideration the calling context.  This allows for the fastest
implementation to be used as appropriate (see kfpu_allowed()).

The only other notable instance of SIMD operations being used
outside a kernel thread was at module load time.  This code
was moved in to a taskq in order to accommodate the new kernel
thread restriction.

Finally, a few other modifications were made in order to further
harden this code and facilitate testing.  They include updating
each implementations operations structure to be declared as a
constant.  And allowing "cycle" to be set when selecting the
preferred ops in the kernel as well as user space.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8754 
Closes #8793 
Closes #8965
2019-07-12 09:31:20 -07:00
Nick Mattis
d230a65c3b Fixes: #8934 Large kmem_alloc
Large allocation over the spl_kmem_alloc_warn value was being performed.
Switched to vmem_alloc interface as specified for large allocations.
Changed the subsequent frees to match.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: nmattis <nickm970@gmail.com>
Closes #8934
Closes #9011
2019-07-10 15:54:49 -07:00
Paul Dagnelie
f664f1ee7f Decrease contention on dn_struct_rwlock
Currently, sequential async write workloads spend a lot of time 
contending on the dn_struct_rwlock. This lock is responsible for 
protecting the entire block tree below it; this naturally results 
in some serialization during heavy write workloads. This can be 
resolved by having per-dbuf locking, which will allow multiple 
writers in the same object at the same time.

We introduce a new rwlock, the db_rwlock. This lock is responsible 
for protecting the contents of the dbuf that it is a part of; when 
reading a block pointer from a dbuf, you hold the lock as a reader. 
When writing data to a dbuf, you hold it as a writer. This allows 
multiple threads to write to different parts of a file at the same 
time.

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens matt@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-52564
External-issue: DLPX-53085
External-issue: DLPX-57384
Closes #8946
2019-07-08 13:18:50 -07:00
Brad Lewis
cb70964221 8659 static dtrace probes unavailable on non-GPL modules
ZFS tracing efforts are hampered by the inability to access zfs static
probes(probes using DTRACE_PROBE macros). The probes are available via
tracepoints for GPL modules only.  The build could be modified to
generate a function for each unique DTRACE_PROBE invocation. These could
be then accessed via kprobes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>
Closes #8659 
Closes #8663
2019-07-08 11:20:53 -07:00
Brian Behlendorf
1086f54219 Revert "Fail early on bio corruption confirmed on 5.2-rc1"
This reverts commit aa7aab6c45.
The change is not compatible with CentOS 6's 2.6.32 based kernel
due to differnces in the bio layer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8961
2019-07-05 20:38:56 -07:00
Tom Caputi
52a83dc694 Remove VERIFY from dsl_dataset_crypt_stats()
This patch fixes an issue where dsl_dataset_crypt_stats() would
VERIFY that it was able to hold the encryption root. This function
should instead silently continue without populating the related
field in the nvlist, as is the convention for this code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8976
2019-07-05 16:53:14 -07:00
Paul Dagnelie
fe0ea84812 Don't activate metaslabs with weight 0
We return ENOSPC in metaslab_activate if the metaslab has weight 0, 
to avoid activating a metaslab with no space available.  For sanity 
checking, we also assert that there is no free space in the range 
tree in that case.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8968
2019-07-05 16:45:20 -07:00
Paul Zuchowski
6dbca94f0c Improve "Unable to automount" error message.
Having the mountpoint and dataset name both in the message made it
confusing to read.  Additionally, convert this to a zfs_dbgmsg rather than
sending it to the console.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8959
2019-07-03 13:05:02 -07:00
Tomohiro Kusumi
aa7aab6c45 Fail early on bio corruption confirmed on 5.2-rc1
Unable to import zpool with "Large kmem_alloc" warning due to
corrupted bio's with invalid # of page vectors.
See #8867 for details.

Fail early with ENOMEM.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8867 
Closes #8961
2019-07-03 13:03:22 -07:00
Brian Behlendorf
46db9d6161
Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional
The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY
is set.  Therefore, the EQUIV check must be wrapped accordingly.
For the same reason the ASSERT in arc_buf_fill() in unsafe.
However, since it's largely redundant it has simply been removed.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8979
2019-07-03 13:01:54 -07:00
Tomohiro Kusumi
df358db7c3 Don't use d_path() for automount mount point for chroot'd process
Chroot'd process fails to automount snapshots due to realpath(3)
failure in mount.zfs(8).

Construct a mount point path from sb of the ctldir inode and dirent
name, instead of from d_path(), so that chroot'd process doesn't get
affected by its view of fs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8903 
Closes #8966
2019-07-02 08:25:23 -07:00
George Wilson
681a85cb01 nopwrites on dmu_sync-ed blocks can result in a panic
After device removal, performing nopwrites on a dmu_sync-ed block
will result in a panic. This panic can show up in two ways:
1. an attempt to issue an IOCTL in vdev_indirect_io_start()
2. a failed comparison of zio->io_bp and zio->io_bp_orig in
   zio_done()
To resolve both of these panics, nopwrites of blocks on indirect
vdevs should be ignored and new allocations should be performed on
concrete vdevs.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #8957
2019-06-28 12:40:23 -07:00
Paul Dagnelie
679b0f2abf Concurrent small allocation defeats large allocation
With the new parallel allocators scheme, there is a possibility for 
a problem where two threads, allocating from the same allocator at 
the same time, conflict with each other. There are two primary cases 
to worry about. First, another thread working on another allocator
activates the same metaslab that the first thread was trying to
activate. This results in the first thread needing to go back and
reselect a new metaslab, even though it may have waited a long time
for this metaslab to load. Second, another thread working on the same
allocator may have activated a different metaslab while the first
thread was waiting for its metaslab to load. Both of these cases
can cause the first thread to be significantly delayed in issuing 
its IOs. The second case can also cause metaslab load/unload churn; 
because the metaslab is loaded but not fully activated, we never set 
the selected_txg, which results in the metaslab being immediately 
unloaded again. This process can repeat many times, wasting disk and 
cpu resources. This is more likely to happen when the IO of the first 
thread is a larger one (like a ZIL write) and the other thread is 
doing a smaller write, because it is more likely to find an 
acceptable metaslab quickly.

There are two primary changes. The first is to always proceed with 
the allocation when returning from metaslab_activate if we were 
preempted in either of the ways described in the previous section. 
The second change is to set the selected_txg before we do the call 
to activate so that even if the metaslab is not used for an 
allocation, we won't immediately attempt to unload it.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-61314
Closes #8843
2019-06-26 11:00:12 -07:00
Alexander Motin
fc7546777b Avoid extra taskq_dispatch() calls by DMU
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8909
2019-06-25 12:03:38 -07:00
loli10K
746d4a451e Fix bp_embedded_type enum definition
With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8951
2019-06-24 18:02:17 -07:00
Matthew Ahrens
59ec30a329 Remove code for zfs remap
The "zfs remap" command was disabled by
6e91a72fe3, because it has little utility
and introduced some tricky bugs.  This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.

Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8944
2019-06-24 16:44:01 -07:00
Tom Caputi
53864800f6 Fix error message on promoting encrypted dataset
This patch corrects the error message reported when attempting
to promote a dataset outside of its encryption root.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8905 
Closes #8935
2019-06-24 16:42:52 -07:00
Brian Behlendorf
8f12a4f8d2
Fix out-of-tree build failures
Resolve the incorrect use of srcdir and builddir references for
various files in the build system.  These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.

With this change it's again possible to build in a subdirectory.

    $ mkdir obj
    $ cd obj
    $ ../configure
    $ make

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8921 
Closes #8943
2019-06-24 09:32:47 -07:00
Don Brady
186898bbb5 OpenZFS 9425 - channel programs can be interrupted
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.

Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete.  Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.

Porting notes: Added missing return value from cv_wait_sig()

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926
Closes #8904
2019-06-22 16:51:46 -07:00
Matthew Ahrens
cb9e5b7e84 dn_struct_rwlock can not be held in dmu_tx_try_assign()
The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock
while assigning the tx, because this can lead to deadlock. Specifically,
if this dnode is already assigned to an earlier txg, this thread may
need to wait for that txg to sync (the ERESTART case below).  The other
thread that has assigned this dnode to an earlier txg prevents this txg
from syncing until its tx can complete (calling dmu_tx_commit()), but it
may need to acquire the dn_struct_rwlock to do so (e.g. via
dmu_buf_hold*()).

This commit adds an assertion to dmu_tx_try_assign() to ensure that this
deadlock is not inadvertently introduced.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8929
2019-06-22 16:48:54 -07:00
Paul Dagnelie
a370182fed Add SCSI_PASSTHROUGH to zvols to enable UNMAP support
When exporting ZVOLs as SCSI LUNs, by default Windows will not
issue them UNMAP commands. This reduces storage efficiency in
many cases.

We add the SCSI_PASSTHROUGH flag to the zvol's device queue,
which lets the SCSI target logic know that it can handle SCSI
commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8933
2019-06-21 09:40:56 -07:00
Tomohiro Kusumi
9585497208 Prevent pointer to an out-of-scope local variable
`show_str` could be a pointer to a local variable in stack
which is out-of-scope by the time
`return (snprintf(buf, buflen, "%s\n", show_str));`
is called.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8924 
Closes #8940
2019-06-20 18:31:52 -07:00
Matthew Ahrens
accd6d9dc4 dedup=verify doesn't clear the blkptr's dedup flag
The logic to handle strong checksum collisions where the data doesn't
match is incorrect. It is not clearing the dedup bit of the blkptr,
which can cause a panic later in zio_ddt_free() due to the dedup table
not matching what is in the blkptr.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-48097
Closes #8936
2019-06-20 18:30:40 -07:00
Igor K
a64f8276c7 Update vdev_ops_t from illumos
Align vdev_ops_t from illumos for better compatibility.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8925
2019-06-20 18:29:02 -07:00
Tom Caputi
da68988708 Allow unencrypted children of encrypted datasets
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.

Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8737 
Closes #8870
2019-06-20 12:29:51 -07:00
Matthew Ahrens
050d720c43 Remove dedupditto functionality
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x).  The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.

ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ).  Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful.  The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad.  This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.

Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.

Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).

In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection.  It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).

We should remove the dedupditto functionality.  For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect.  For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate.  However, ZFS won't write any new dedupditto
blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Issue #8270 
Closes #8310
2019-06-19 14:54:02 -07:00
Paul Dagnelie
30af21b025 Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to 
a target system. One possible use case for this feature is to not 
transmit sensitive information to a data warehousing, test/dev, or 
analytics environment. Another is to save space by not replicating 
unimportant data within a given dataset, for example in backup tools 
like zrepl.

Redacted send/receive is a three-stage process. First, a clone (or 
clones) is made of the snapshot to be sent to the target. In this 
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction 
snapshot" (or snapshots). Second, the new zfs redact command is used 
to create a redaction bookmark. The redaction bookmark stores the 
list of blocks in a snapshot that were modified by the redaction 
snapshot(s). Finally, the redaction bookmark is passed as a parameter 
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive 
or unwanted information, and those blocks are not included in the send 
stream.  When sending from the redaction bookmark, the blocks it 
contains are considered as candidate blocks in addition to those 
blocks in the destination snapshot that were modified since the 
creation_txg of the redaction bookmark.  This step is necessary to 
allow the target to rehydrate data in the case where some blocks are 
accidentally or unnecessarily modified in the redaction snapshot.

The changes to bookmarks to enable fast space estimation involve 
adding deadlists to bookmarks. There is also logic to manage the 
life cycles of these deadlists.

The new size estimation process operates in cases where previously 
an accurate estimate could not be provided. In those cases, a send 
is performed where no data blocks are read, reducing the runtime 
significantly and providing a byte-accurate size estimate.

Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 09:48:12 -07:00
Alexander Motin
c1b5801bb5 Minimize aggsum_compare(&arc_size, arc_c) calls.
For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8901
2019-06-14 14:07:33 -07:00
Matthew Ahrens
53dce5acc6 panic in removal_remap test on 4K devices
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.

On debug bits this assertion will fail in spa_vdev_copy_segment():
    ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);

On nondebug, the system panics with a stack like:
    metaslab_free_concrete()
    metaslab_free_impl()
    metaslab_free_impl_cb()
    vdev_indirect_remap()
    free_from_removing_vdev()
    metaslab_free_impl()
    metaslab_free_dva()
    metaslab_free()

Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default.  We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.

This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes #8893
2019-06-13 13:12:39 -07:00
Matthew Ahrens
be89734a29 compress metadata in later sync passes
Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable
compression (including of metadata).  Ostensibly this helps the sync
passes to converge (i.e. for a sync pass to not need to allocate
anything because it is 100% overwrites).

However, in practice it increases the average number of sync passes,
because when we turn compression off, a lot of block's size will change
and thus we have to re-allocate (not overwrite) them.  It also increases
the number of 128KB allocations (e.g. for indirect blocks and spacemaps)
because these will not be compressed.  The 128K allocations are
especially detrimental to performance on highly fragmented systems,
which may have very few free segments of this size, and may need to load
new metaslabs to satisfy 128K allocations.

We should increase zfs_sync_pass_dont_compress.  In practice on a highly
fragmented system we see a few 5-pass txg's, a tiny number of 6-pass
txg's, and no txg's with more than 6 passes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-63431
Closes #8892
2019-06-13 13:10:18 -07:00
Alexander Motin
ae5c78e0b1 Move write aggregation memory copy out of vq_lock
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8890
2019-06-13 13:08:24 -07:00
Matthew Ahrens
d3230d761a looping in metaslab_block_picker impacts performance on fragmented pools
On fragmented pools with high-performance storage, the looping in
metaslab_block_picker() can become the performance-limiting bottleneck.
When looking for a larger block (e.g. a 128K block for the ZIL), we may
search through many free segments (up to hundreds of thousands) to find
one that is large enough to satisfy the allocation. This can take a long
time (up to dozens of ms), and is done while holding the ms_lock, which
other threads may spin waiting for.

When this performance problem is encountered, profiling will show
high CPU time in metaslab_block_picker, as well as in mutex_enter from
various callers.

The problem is very evident on a test system with a sync write workload
with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage,
84% full and 88% fragmented. It has also been observed on production
systems with 90TB of storage, 76% full and 87% fragmented.

The fix is to change metaslab_df_alloc() to search only up to 16MB from
the previous allocation (of this alignment). After that, we will pick a
segment that is of the exact size requested (or larger). This reduces
the number of iterations to a few hundred on fragmented pools (a ~100x
improvement).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-62324
Closes #8877
2019-06-13 13:06:15 -07:00
Tulsi Jain
9c7da9a95a Restrict filesystem creation if name referred either '.' or '..'
This change restricts filesystem creation if the given name
contains either '.' or '..'

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes #8842 
Closes #8564
2019-06-13 08:56:15 -07:00
Matthew Ahrens
3475724ea4 ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread()
When running zloop, we occasionally see the following crash:

    dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0)
    ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3]

The error value 0x1c is ENOSPC.

The transaction used by spa_vdev_remove_thread() should not be able to
fail due to being out of space. i.e. we should not call
dmu_tx_hold_space().  This will allow the removal thread to schedule its
work even when the pool is low on space.  The "slop space" will provide
enough free space to sync out the txg.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-37853
Closes #8889
2019-06-13 08:48:42 -07:00
Tomohiro Kusumi
daddbdc7cc Fix lockdep warning on insmod
sysfs_attr_init() is required to make lockdep happy for dynamically
allocated sysfs attributes. This fixed #8868 on Fedora 29 running
kernel-debug.

This requirement was introduced in 2.6.34.
See include/linux/sysfs.h for what it actually does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8868 
Closes #8884
2019-06-12 17:15:06 -07:00
Matthew Ahrens
d9b4bf0665 fat zap should prefetch when iterating
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.

For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes #8862
2019-06-12 13:13:09 -07:00
Matthew Ahrens
d9cd66e45f Target ARC size can get reduced to arc_c_min
Sometimes the target ARC size is reduced to arc_c_min, which impacts
performance.  We've seen this happen as part of the random_reads
performance regression test, where the ARC size is reduced before the
reads test starts which impacts how long it takes for system to reach
good IOPS performance.

We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE,
and arc_available_memory() is less than arc_c>>arc_shrink_shift.

However, arc_available_memory() could easily be low, even when arc_c is
low, because we can have tons of unused bufs in the abd kmem cache. This
would be especially true just after the DMU requests a bunch of stuff be
evicted from the ARC (e.g. due to "zpool export").

To fix this, the ARC should reduce arc_c by the requested amount, not
all the way down to arc_size (or arc_c_min), which can be very small.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59431
Closes #8864
2019-06-12 13:06:55 -07:00
bnjf
10269e02f9 Fix typo in vdev_raidz_math.c
Fix typo in vdev_raidz_math.c

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brad Forschinger <github@bnjf.id.au>
Closes #8875 
Closes #8880
2019-06-12 13:03:33 -07:00
Matthew Ahrens
5662fd5794 single-chunk scatter ABDs can be treated as linear
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8580
2019-06-11 09:02:31 -07:00
Matthew Ahrens
b8738257c2 make zil max block size tunable
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations.  The large allocations are for ZIL blocks.  If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more.  In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
2019-06-10 11:48:42 -07:00
Alexander Motin
5a902f5aaa Fix comparison signedness in arc_is_overflowing()
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Closes #8873
2019-06-10 09:52:25 -07:00
Tom Caputi
c08c30ed13 Fix incorrect error message for raw receive
This patch fixes an incorrect error message that comes up when
doing a non-forcing, raw, incremental receive into a dataset
that has a newer snapshot than the "from" snapshot. In this
case, the current code prints a confusing message about an IVset
guid mismatch.

This functionality is supported by non-raw receives as an
undocumented feature, but was never supported by the raw receive
code. If this is desired in the future, we can probably figure
out a way to make it work.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8758 
Closes #8863
2019-06-10 09:45:08 -07:00
Paul Dagnelie
893a6d62c1 Allow metaslab to be unloaded even when not freed from
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure, 
on heavily fragmented pools they can still consume a significant 
amount of memory. This problem is amplified when we fail to unload 
metaslabs that we aren't using. Currently, we only unload a metaslab 
during metaslab_sync_done; in order for that function to be called 
on a given metaslab in a given txg, we have to have dirtied that 
metaslab in that txg. If the dirtying was the result of an allocation, 
we wouldn't be unloading it (since it wouldn't be 8 txgs since it 
was selected), so in effect we only unload a metaslab during txgs 
where it's being freed from.

We move the unload logic from sync_done to a new function, and 
call that function on all metaslabs in a given vdev during 
vdev_sync_done().

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8837
2019-06-06 19:10:43 -07:00
Tom Caputi
fd7a657bff Reinstate raw receive check when truncating
This patch re-adds a check that was removed in 369aa50. The check
confirms that a raw receive is not occuring before truncating an
object's dn_maxblkid. At the time, it was believed that all cases
that would hit this code path would be handled in other places,
but that was not the case.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8852 
Closes #8857
2019-06-06 13:47:33 -07:00
Allan Jude
b7109a413c l2arc_apply_transforms: Fix typo in comment
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #8822
2019-06-06 13:14:48 -07:00
Serapheim Dimitropoulos
cb020f0d86 Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold
Historically while doing performance testing we've noticed that IOPS
can be significantly reduced when all vdevs in the pool are hitting
the zfs_mg_fragmentation_threshold percentage. Specifically in a
hypothetical pool with two vdevs, what can happen is the following:
Vdev A would go above that threshold and only vdev B would be used.
Then vdev B would pass that threshold but vdev A would go below it
(we've been freeing from A to allocate to B). The allocations would
go back and forth utilizing one vdev at a time with IOPS taking a hit.

Empirically, we've seen that our vdev selection for allocations is
good enough that fragmentation increases uniformly across all vdevs
the majority of the time. Thus we set the threshold percentage high
enough to avoid hitting the speed bump on pools that are being pushed
to the edge. We effectively disable its effect in the majority of the
cases but we don't remove (at least for now) just in case we hit any
weird behavior in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8859
2019-06-06 13:08:41 -07:00
Tom Caputi
3ce85b5e60 Fix integer overflow of ZTOI(zp)->i_generation
The ZFS on-disk format stores each inode's generation ID as a 64
bit number on disk and in-core. However, the Linux kernel's inode
is only a 32 bit number. In most places, the code handles this
correctly, but the cast is missing in zfs_rezget(). For many pools,
this isn't an issue since the generation ID is computed as the
current txg when the inode is created and many pools don't have
more than 2^32 txgs.

For the pools that have more txgs, this issue causes any inode with
a high enough generation number to report IO errors after a call to
"zfs rollback" while holding the file or directory open. This patch
simply adds the missing cast.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8858
2019-06-06 12:59:39 -07:00
Tomohiro Kusumi
1a6889947e Drop objid argument in zfs_znode_alloc() (sync with OpenZFS)
Since zfs_znode_alloc() already takes dmu_buf_t*, taking another
uint64_t argument for objid is redundant. inode's ->i_ino does and
needs to match znode's ->z_id.

zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument
since vnode doesn't have vnode# in VFS (hence ->z_id exists).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8841
2019-06-05 14:18:46 -07:00
DeHackEd
df24bcf00a Wait in 'S' state when send/recv pipe is blocking
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #8733 
Closes #8752
2019-06-03 20:54:43 -07:00
TulsiJain
a3c98d5728 Make zfs_async_block_max_blocks handle zero correctly
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes #8829
Closes #8289
2019-06-03 09:40:48 -07:00
Brian Behlendorf
2531ce3720
Revert "Report holes when there are only metadata changes"
This reverts commit ec4f9b8f30 which introduced a narrow race which
can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO.  Resolve
the issue by revering this change to restore the previous behavior
which depends solely on checking the dirty list.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8816 
Closes #8834
2019-05-30 17:13:18 -07:00
Tomohiro Kusumi
fe0c9f409a Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)
Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and
vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading
zfs.ko.

The rest of initialization functions being called here after cwd set
to / don't depend on cwd of the process except for spa_config_load().
spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when
`rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /,
so just unconditionally use the absolute path without "./", so that
`vn_set_pwd("/")` as well as the entire functions can be removed.
This is also what FreeBSD does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8826
2019-05-29 16:18:14 -07:00
Brian Behlendorf
1e724f4f34
Exclude log device ashift from normal class
When opening a log device during import its allocation bias will
not yet have been set by vdev_load().  This results in the log
device's ashift being incorrectly applied to the maximum ashift
of the vdevs in the normal class.  Which in turn prevents the
removal of any top-level devices due to the ashift check in the
spa_vdev_remove_top_check() function.

This issue is resolved by including vdev_islog in the check since
it will be set correctly during vdev_open().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8735
2019-05-29 11:35:50 -07:00
madz
ec4afd27f1 Fix integer overflow in get_next_chunk()
dn->dn_datablksz type is uint32_t and need to be casted to uint64_t
to avoid an overflow when the record size is greater than 4 MiB.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr>
Closes #8778 
Closes #8797
2019-05-29 10:17:25 -07:00
loli10K
ab1a9705f8 Double-free of encryption wrapping key due to invalid pool properties
This commits fixes a double-free in zfs_ioc_pool_create() triggered by
specifying an unsupported combination of properties when creating a pool
with encryption enabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8791
2019-05-28 15:19:50 -07:00
Tomohiro Kusumi
841a7a98fc Update descriptions for vnops
These descriptions are not uptodate with the code.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8767
2019-05-25 14:29:10 -07:00
Tom Caputi
3b61ca3e57 Fix embedded bp accounting in count_block()
Currently, count_block() does not correctly account for the
possibility that the bp that is passed to it could be embedded.
These blocks shouldn't be counted since the work of scanning
these blocks in already handled when the containing block is
scanned. This patch simply resolves this issue by returning
early in this case.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Bill Sommerfeld <sommerfeld@alum.mit.edu>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8800 
Closes #8766
2019-05-25 13:52:23 -07:00
Tomohiro Kusumi
bfd5a709e7 Linux 5.2 compat: Directly call wait_on_page_bit()
wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd.

Directly call wait_on_page_bit() without using wait_on_page_writeback()
interface, given zfs_putpage() is the only caller for now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8794
2019-05-25 13:42:09 -07:00
Tomohiro Kusumi
4bb17ebfe2 Linux 5.2 compat: Remove config/kernel-set-fs-pwd.m4
This failed on 5.2-rc1 with "error: unknown" message, for set_fs_pwd()
not being visible in both const and non-const tests.

This is caused by torvalds/linux@83da1bed86. It's configurable,
but we would want to be able to compile with default kbuild setting.

set_fs_pwd() has never been exported with exception of some distro
kernels, and set_fs_pwd() wasn't used in ZoL to begin with. The test
result was used for a spl function vn_set_fs_pwd().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8777
2019-05-25 13:28:55 -07:00
Tomohiro Kusumi
c3e5907fdb Drop local definition of MOUNT_BUSY
It's accessible via <sys/mntent.h>.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8765
2019-05-24 16:43:23 -07:00
loli10K
2e8c315fc6 Device removal panics on 32-bit systems
The issue is caused by an incorrect usage of the sizeof() operator
in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue
since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on
32-bit systems pointers are 4 bytes long which is not supported by
zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system
will cause the following failure:

VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22)
PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete()
Showing stack for process 1315
CPU: 6 PID: 1315 Comm: txg_sync Tainted: P           OE   4.4.69+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0
 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341
 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76
Call Trace:
 [<>] dump_stack+0x58/0x7c
 [<>] spl_dumpstack+0x23/0x27 [spl]
 [<>] spl_panic.cold.0+0x5/0x41 [spl]
 [<>] ? dbuf_rele+0x3e/0x90 [zfs]
 [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs]
 [<>] ? zap_lookup+0x57/0x70 [zfs]
 [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs]
 [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs]
 [<>] ? txg_verify+0x1d/0x160 [zfs]
 [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs]
 [<>] vdev_sync+0xbf/0x550 [zfs]
 [<>] ? mutex_lock+0x10/0x30
 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs]
 [<>] ? zap_contains+0x4d/0x70 [zfs]
 [<>] spa_sync+0x9f1/0x1b10 [zfs]
 ...
 [<>] ? kthread_stop+0x110/0x110

This commit simply corrects the "integer_size" parameter used to lookup
the vdev's ZAP object.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8790
2019-05-24 12:17:52 -07:00
loli10K
75c09c5060 Fix coverity defects: CID 186143
CID 186143: Memory - illegal accesses (USE_AFTER_FREE)

This patch fixes an use-after-free in spa_import_progress_destroy()
moving the kmem_free() call at the end of the function.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8788
2019-05-23 19:17:00 -07:00
Rafael Kitover
8b8b44d06f kernel timer API rework
In `config/kernel-timer.m4` refactor slightly to check more generally
for the new `timer_setup()` APIs, but also check the callback signature
because some kernels (notably 4.14) have the new `timer_setup()` API but
use the old callback signature. Also add a check for a `flags` member in
`struct timer_list`, which was added in 4.1-rc8.

Add compatibility shims to `include/spl/sys/timer.h` to allow using the
new timer APIs with the only two caveats being that the callback
argument type must be declared as `spl_timer_list_t` and an explicit
assignment is required to get the timer variable for the `timer_of()`
macro. So the callback would look like this:

```c
__cv_wakeup(spl_timer_list_t t)
{
        struct timer_list *tmr = (struct timer_list *)t;
	struct thing *parent = from_timer(parent, tmr,
		parent_timer_field);
	... /* do stuff with parent */
```

Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the
new timer APIs instead of conditional code.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes #8647
2019-05-23 14:40:28 -07:00
Richard Elling
78fac8d925 Fix kstat state update during pool transition
When reading kstats, the health (aka state) of the pool is stored into
/proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name().
However, during import/export there is a case where the spa exists,
but the root vdev does not exist. This fix checks that case and sets
the state to "TRANSITIONING"

Unfortunately, it is not easy to reproduce a test for this. It was
detected randomly during ZTS runs while kstats were also being sampled
regularly. After this change, further testing did not trip on the case
and the TRANSITIONING state was collected at least once by the kstats.

For posterity, the backtrace prior to this fix is:
[Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs]
...
Mon May 13 17:21:00 2019] Call Trace:
[Mon May 13 17:21:00 2019]  spa_state_data+0x1a/0x40 [zfs]
[Mon May 13 17:21:00 2019]  kstat_seq_show+0x117/0x440 [spl]
[Mon May 13 17:21:00 2019]  seq_read+0xe5/0x430
[Mon May 13 17:21:00 2019]  proc_reg_read+0x45/0x70
[Mon May 13 17:21:00 2019]  __vfs_read+0x1b/0x40
[Mon May 13 17:21:00 2019]  vfs_read+0x8e/0x130
[Mon May 13 17:21:00 2019]  SyS_read+0x55/0xc0
[Mon May 13 17:21:00 2019]  ? SyS_fcntl+0x5d/0xb0
[Mon May 13 17:21:00 2019]  do_syscall_64+0x73/0x130
[Mon May 13 17:21:00 2019]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8746
2019-05-23 14:28:53 -07:00
Brian Behlendorf
bff2361aeb
Linux 5.2 compat: rw_tryupgrade()
Commit torvalds/linux@46ad0840b has removed the architecture specific
rwsem source and headers leaving only the generic version.  As part
of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS
macros were moved to the private kernel/locking/rwsem.h header.
This results in a build failure because these macros were required
to implement the rw_tryupgrade() compatibility function.

In practice, this isn't a major problem because there are only a
few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade
should be written to retry using rw_enter(RW_WRITER).

After auditing all of the callers only dmu_zfetch() was determined
not to perform a retry.  It has been updated in this commit to
resolve this issue.

That said, the rw_tryupgrade() functionality should be considered
for possible removal in a future release due to the difficultly
in supporting the interface.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8730
2019-05-23 13:46:33 -07:00
Paul Dagnelie
e61b53475e Fix incorrect assertion in dnode_dirty_l1range
The db_dirtycnt of an EVICTING dbuf is always 0. However, it still 
appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a 
range that contains an EVICTING dbuf, we will attempt to mark it dirty 
(which will fail because it's EVICTING, resulting in a new dbuf being 
created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the 
dbufs in the range are dirty. If the EVICTING dbuf is still present, 
this will trip the assertion erroneously.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8745
2019-05-19 17:30:33 -07:00
Olaf Faaland
ca95f70dff zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time.  The user may
not know why, or be able to tell whether the import is progressing or is
hung.

Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one).  The kstat is /proc/spl/kstat/zfs/import_progress.

The kstat contents are as follows:
pool_guid         load_state multihost_secs  max_txg pool_name
16667015954387398 3          15              0       tank3

load_state: the value of spa_load_state
multihost_secs:  seconds until the end of the multihost activity
                 check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring

This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting.  The
zpool import subcommand could also be modified to report this
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 10:08:05 -07:00
Tomohiro Kusumi
b689de85e8 Add missing trailing '\n' in printk() messages
These messages will want '\n' like any other regular printk() messages.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8726
2019-05-08 16:43:55 -07:00
Tomohiro Kusumi
97aa3ba44f Fix link count of root inode when snapdir is visible
Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by
->getattr() of zpl inodes) also needs to consider an additional link
count if "snapdir" property is set to "visible".

Without this, # of directories in root inode of each dataset doesn't
match the link count when snapdir is visible.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8727
2019-05-08 16:40:51 -07:00
Brian Behlendorf
a20f43b51b
Linux 5.0 compat: ASM_BUG macro
The 5.0 kernel defines the macro ASM_BUG.  In order to prevent a
conflict and build failure rename ASM_BUG to ZFS_ASM_BUG.  This
is currently only an issue on aarch64 but all instances of
ASM_BUG we're renamed to avoid any future conflict on x86_64.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8725 
Issue #8545
2019-05-08 10:18:40 -07:00
Brian Behlendorf
515ddf6504
Fix errant EFAULT during writes (#8719)
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the  structure.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8640 
Closes #8719
2019-05-08 10:04:04 -07:00
DeHackEd
1f02ecc5a5 Make zfs_special_class_metadata_reserve_pct into a parameter
Exported and documented a new module parameter.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #8706
2019-05-07 15:34:42 -07:00
Brian Behlendorf
caf9dd209f
Fix send/recv lost spill block
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object.  It may need to be removed or kept depending on how the
object was modified at the source.

This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties.  This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.

What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose.  When set then
the object references a spill block and it must be kept.  Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record.  Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.

Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible.  Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed.  Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.

The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS.  Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8668
2019-05-07 15:18:44 -07:00
Tomohiro Kusumi
9c53e51616 Fix zfs set atime|relatime=off|on behavior on inherited datasets
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.

(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)

--
 # zpool create p1 <device>
 # zfs create p1/f1
 # zfs set atime=off p1
 # echo test > /p1/f1/test
 # sync
 # zfs list
 NAME    USED  AVAIL     REFER  MOUNTPOINT
 p1      176K  18.9G     25.5K  /p1
 p1/f1    26K  18.9G       26K  /p1/f1
 # zfs get atime
 NAME   PROPERTY  VALUE  SOURCE
 p1     atime     off    local
 p1/f1  atime     off    inherited from p1
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:33.741205192 +0900
 # cat /p1/f1/test
 test
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:50.173231861 +0900
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--

The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.

Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.

Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.

The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8674 
Closes #8675
2019-05-07 10:06:30 -07:00
Tomohiro Kusumi
a762893269 Fix typo/etc in module/zfs/zfs_ctldir.c
Drop duplicated phrases in comments.

Also drop an obsolete comment "Perform a mount of the associated...",
as all it does now is get objid from DMU and lookup incore inode.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8707
2019-05-05 15:57:23 -07:00
Tomohiro Kusumi
de3e0b914b Linux 5.0 compat: Use totalhigh_pages()
Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839
"mm: convert totalram_pages and totalhigh_pages variables to atomic"

replaced `totalhigh_pages` with an inline function `totalhigh_pages()`.
This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages`
on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8677
Closes #8701
2019-05-04 16:40:48 -07:00
John Gallagher
1eacf2b3b0 Improve rate at which new zvols are processed
The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.

Waiting for these txgs to sync is only necessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unnecessary.

This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().

Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #8526 
Closes #8615
2019-05-04 16:39:10 -07:00
Matthew Ahrens
8d9f616511 Reword comment in lz4_compress_zfs
The comment in lz4_compress_zfs could be more clear and specific.  It
also contains needlessly strong language.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes: #8702
Closes: #8703
2019-05-02 16:46:04 -07:00
Tom Caputi
fa24166074 Add feature check for 'zpool resilver' command
The 'zpool resilver' command requires that the resilver_defer
feature is active on the pool. Unfortunately, the check for
this was left out of the original patch. This commit simply
corrects this so that the command properly returns an error
in this case.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8700
2019-05-02 16:42:31 -07:00
Tomohiro Kusumi
f0ce0436aa Correct snprintf() size argument
The size argument of snprintf(3) in glibc and snprintf() in Linux
kernel includes trailing \0, as snprintf(3) man page explains it as
"write at most size bytes (including the trailing null byte ('\0'))",
i.e. snprintf() can just take buffer size.

e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is
MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size
should just be `path_len` to do what the caller is trying to do.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8692
2019-04-30 19:41:12 -07:00
Brian Behlendorf
c12ea77865
Linux 5.0 compat: Remove incorrect ASSERT
Not all block devices, notably scsi_debug, set a root_blkg on the
request queue.  Remove this assertion and allow the the existing
call to blkg_tryget() to gracefully handle the NULL (which it does).

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8678
2019-04-29 18:20:21 -07:00
Tomohiro Kusumi
b43a27f76f Use NV_ENCODE_NATIVE for nvlist encoding variable
Use NV_ENCODE_NATIVE for nvlist encoding variable instead of 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8653
2019-04-26 11:24:31 -07:00
Tomohiro Kusumi
126d0fa733 Use SEEK_{SET,CUR,END} for file seek "whence"
Use either SEEK_* or 0,1,2..., but not both.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8656
2019-04-25 10:17:27 -07:00
Tom Caputi
f4c594da94 Fixes for the DMU free throttle
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of effected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.

The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8655
2019-04-25 10:16:24 -07:00
Tomohiro Kusumi
34d343c2a8 Drop unused ZNODE_STATS and ZNODE_STAT_ADD()
Unused since 5649246dd3("Remove znode move functionality"),
and ZNODE_STAT_ADD() will never be needed.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8636
2019-04-19 12:05:15 -07:00
Tomohiro Kusumi
a35c12073c Fix incorrect "[UNUSED]" comments
These aren't unused.
`flag` in zfs_create() also isn't to indicate large file.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8635
2019-04-19 12:03:32 -07:00
cfzhu
5090f72743 Code improvement and bug fixes for QAT support
1. Support QAT when ZFS is root file-system:
   When ZFS module is loaded before QAT started, the QAT can
   be started again in post-process, e.g.:
   echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable
   echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable
   echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
2. Verify alder checksum of the de-compress result
3. Allocate Digest, IV and AAD buffer in physical contiguous
   memory by QAT_PHYS_CONTIG_ALLOC.
4. Update the documentation for zfs_qat_compress_disable,
   zfs_qat_checksum_disable, zfs_qat_encrypt_disable.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Closes #8323 
Closes #8610
2019-04-16 12:38:36 -07:00
Richard Laager
59f6594cf6 Restructure vec_idx loops
This replaces empty for loops with while loops to make the code easier
to read.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: github.com/dcb314
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #6681
Closes #6682
Closes #6683
Closes #8623
2019-04-16 12:34:06 -07:00
Richard Laager
fcf21f8fcb Update a comment to match the code
GRUB supports large_blocks.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:33 -07:00
Richard Laager
393363c5ec Consistently captialize GUID for features
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:51 -07:00
Richard Laager
612c4930dd Fix the spelling of deferred
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:45 -07:00
Tom Caputi
c2c6eadf29 Fix issues with truncated files in raw sends
When receiving a raw send stream only reallocated objects
whose contents were not freed by the standard indicators
should call dmu_free_long_range().

Furthermore, if calling dmu_free_long_range() is required
then the objects current block size must be used and not
the new block size.

Two additional test cases were added to provided realistic
test coverage for processing reallocated objects which are
part of a raw receive.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8528
Closes #8607
2019-04-15 15:28:48 -07:00
Tom Caputi
7dcd318832 Cleanup nits from ab7615d92
This patch simply up cleans up a nit and corrects an error message
issue that were introduced in the Multiple DVA scrub patch.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8619
2019-04-14 11:03:06 -07:00
Brian Behlendorf
b92f5d9f82
Fix issue in receive_object() during reallocation
When receiving an object to a previously allocated interior slot
the new object should be "allocated" by setting DMU_NEW_OBJECT,
not "reallocated" with dnode_reallocate().  For resilience verify
the slot is free as required in case the stream is malformed.

Add a test case to generate more realistic incremental send streams
that force reallocation to occur during the receive.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8067 
Closes #8614
2019-04-12 14:28:04 -07:00
Brian Behlendorf
3fa93bb8d3
Fix TXG_MASK cstyle
Fix style issue for 'tx->tx_txg&TXG_MASK'.  There should be white
space around the '&' character.  Split the dnode_reallocate() ASSERT
to make it more readable to clearly separate the checks.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8606
2019-04-12 11:30:59 -07:00
Jorgen Lundman
48ed0f9da0 Always call rw_init in zio_crypt_key_unwrap
The error path in zio_crypt_key_unwrap would call zio_crypt_key_destroy which
calls rw_destroy(&key->zk_salt_lock); which has not yet been initialized.

We move the rw_init() call to the start of zio_crypt_key_unwrap instead.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #8604
Closes #8605
2019-04-10 15:39:40 -07:00
Tim Chase
8cb34421e0 Avoid stack overwrite in zfs_setattr_dir()
The bulk[] array index, count, must be reset per-iteration in order to
not overwrite the stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8072
Closes #8597
Closes #8601
2019-04-10 15:38:21 -07:00
Tomohiro Kusumi
9a65234c8b Unbreak build on Linux kernel < 3.10
d12614521a("Fixes for procfs files backed by linked lists")
uses PDE_DATA(), but since PDE_DATA() (public interface which
replaced old public interface PDE()) first appeared in upstream
kernel 3.10, it lacks visible local definition for kernel < 3.10.

Move the local PDE_DATA() definition to a ZoL header, to unbreak
build on kernel < 3.10.

--
module/spl/spl-procfs-list.c: In function 'procfs_list_open':
module/spl/spl-procfs-list.c:166: error: implicit declaration of function 'PDE_DATA'
module/spl/spl-procfs-list.c:166: warning: assignment makes pointer from integer without a cast

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8599
2019-04-08 14:59:24 -07:00
Brian Behlendorf
d93d4b1acd
Revert "Fix issues with truncated files in raw sends"
This partially reverts commit 5dbf8b4ed.  This change resolved
the issues observed with truncated files in raw sends.  However,
the required changes to dnode_allocate() introduced a regression
for non-raw streams which needs to be understood.

The additional debugging improvements from the original patch
were not reverted.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7378
Issue #8528
Issue #8540
Issue #8565
Close #8584
2019-04-05 17:32:56 -07:00
Matthew Ahrens
944a37248a predictive prefetch disabled on new pools until export/reboot
When a pool is initially created (by `zpool create`), predictive
prefetch is inadvertently disabled, until the pool is export/import-ed,
or the machine is rebooted.

When device removal was introduced, we added some code to disable
predictive prefetching until indirect vdevs have been loaded.  This
resulted in the "default state" of prefetch being disabled, until we
proactively enable it after indirect vdevs are loaded.  Unfortunately
this resulted in a few bugs where in some code paths we neglect to
enable predictive prefetch.  The first of these was fixed by
20507534d4

This commit fixes another case where we also need to explicitly enable
predictive prefetch, when the pool is initially created.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8577
2019-04-05 10:12:02 -07:00
Don Brady
b4ddec7af6 features.kernel layout should match features.pool
The features.kernel layout should match features.pool.

Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8566
2019-04-04 19:00:55 -07:00
Sara Hartse
a887d653b3 Restrict kstats and print real pointers
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8467 
Closes #8476
2019-04-04 18:57:06 -07:00
Brian Behlendorf
f4e35b165c
Fix txg_wait_open() load average inflation
Callers of txg_wait_open() which set should_quiesce=B_TRUE should be
accounted for as iowait time.  Otherwise, the caller is understood
to be idle and cv_wait_sig() is used to prevent incorrectly inflating
the system load average.

Similarly txg_wait_wait() has been updated to use cv_wait_io() to
be accounted against iowait.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8550
Closes #8558
2019-04-04 09:44:46 -07:00
Brian Behlendorf
1b939560be
Add TRIM support
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
2019-03-29 09:13:20 -07:00
Tom Caputi
5dbf8b4edd Fix issues with truncated files in raw sends
This patch fixes a few issues with raw receives involving
truncated files:

* dnode_reallocate() now calls dnode_set_blksz() instead of
  dnode_setdblksz(). This ensures that any remaining dbufs with
  blkid 0 are resized along with their containing dnode upon
  reallocation.

* One of the calls to dmu_free_long_range() in receive_object()
  needs to check that the object it is about to free some contents
  or hasn't been completely removed already by a previous call to
  dmu_free_long_object() in the same function.

* The same call to dmu_free_long_range() in the previous point
  needs to ensure it uses the object's current block size and
  not the new block size. This ensures the blocks of the object
  that are supposed to be freed are completely removed and not
  simply partially zeroed out.

This patch also adds handling for DRR_OBJECT_RANGE records to
dprintf_drr() for debugging purposes.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7378 
Closes #8528
2019-03-27 11:30:48 -07:00
Igor K
c048ddaf33 Fix vd_path and error in spa_vdev_remove()
Make a local copy of the vd_path and preserve the removal error
for use in spa_history_log_internal().  This is required because
after spa_vdev_exit() there is nothing preventing the vdev state
from changing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8522
2019-03-22 13:25:07 -07:00
Roman Strashkin
234234ca4d Panic when running 'zpool split'
Added missing remove of detachable VDEV from txg's DTL list
to avoid use-after-free for the split VDEV

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes #5565 
Closes #7856
2019-03-22 13:11:36 -07:00
George Wilson
2efea7c82c ZFS Reads may result in unneccesary calls to zil_commit
ZFS supports O_RSYNC for read operations and when specified will ensure
the same level of data integrity that O_DSYNC and O_SYNC provides for
writes. O_RSYNC by itself has no effect so it must be combined with
either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC
and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect
and causes unnecessary calls to zil_commit. Only platforms which
support O_RSYNC should implement the zil_commit functionality in the
read code path.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8523
2019-03-22 13:09:11 -07:00
Olaf Faaland
060f0226e6 MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 12:47:57 -07:00
Jorgen Lundman
d10b2f1d35 Mutex leak in dsl_dataset_hold_obj()
In addition to dsl_dataset_evict_async() releasing a hold, there is
an error case in dsl_dataset_hold_obj() which had missed 4 additional
release calls.  This was introduced in a1d477c24.

openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c

Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8517
2019-03-21 10:36:58 -07:00
cfzhu
45001b949c QAT: Allocate digest_buffer using QAT_PHYS_CONTIG_ALLOC()
If the buffer 'digest_buffer' is allocated in the qat_checksum()
stack, it can't ensure that the address is physically contiguous,
and the DMA result of the buffer may be handled incorrectly.
Using QAT_PHYS_CONTIG_ALLOC() ensures a physically
contiguous allocation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chengfei, Zhu <chengfeix.zhu@intel.com>
Closes #8323 
Closes #8521
2019-03-21 10:35:18 -07:00
Brian Behlendorf
ec4f9b8f30
Report holes when there are only metadata changes
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Reviewed-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6958 
Closes #8505
2019-03-21 10:30:15 -07:00
Julian Heuking
304d469dcd Add missing dmu_zfetch_fini() in dnode_move_impl()
As it turns out, on the Windows platform when rw_init() is called
(rather its bedrock call ExInitializeResourceLite) it is placed on
an active-list of locks, and is removed at rw_destroy() time.

dnode_move() has logic to copy over the old-dnode to new-dnode,
including calling dmu_zfetch_init(new-dnode). But due to the missing
dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the
memory (and in debug builds fill pattern 0xdeadbeef) over the Windows
active-lock's prev/next list pointers, making Windows sad.

But on other platforms, the contents of dmu_zfetch_fini() is one
call to list_destroy() and one to rw_destroy(), which is effectively
a no-op call and is not required. This commit is mostly for
"correctness" and can be skipped there.

Porting Notes:
* This leak exists on Linux but currently can never happen because
  the dnode_move() functionality is not supported.

openzfsonosx-commit: openzfsonosx/zfs@d95fe517

Authored by: Julian Heuking <JulianH@beckhoff.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #8519
2019-03-20 15:06:55 -07:00
Brian Behlendorf
ca6c7a94c9
Fix l2arc_evict() destroy race
When destroying an arc_buf_hdr_t its identity cannot be discarded
until it is entirely undiscoverable.  This not only includes being
unhashed, but also being removed from the l2arc header list.
Discarding the header's identify prematurely renders the hash
lock useless because it will always hash to bucket zero.

This change resolves a race with l2arc_evict() by discarding the
identity after it has been removed from the l2arc header list.
This ensures either the header is not on the list or contains
the correct identify.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7688 
Closes #8144
2019-03-15 14:17:38 -07:00
Tom Caputi
ab7615d92c Multiple DVA Scrubbing Fix
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.

This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.

To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8453
2019-03-15 14:14:31 -07:00
Tony Hutter
2bbec1c910 Make zpool status counters match error events count
The number of IO and checksum events should match the number of errors
seen in zpool status.  Previously there was a mismatch between the
two counts because zpool status would only count unrecovered errors,
while zpool events would get an event for *all* errors (recovered or
not).  This lead to situations where disks could be faulted for
"too many errors", while at the same time showing zero errors in zpool
status.

This fixes the zpool status error counters to increment at the same
times we post the error events.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4851 
Closes #7817
2019-03-14 18:21:53 -07:00
Tom Caputi
04a3b0796c Fix memory leaks in zfsvfs_create_impl()
This patch simply fixes some small memory leaks that can happen
during error handling in zfsvfs_create_impl(). If the function
fails, it frees all the memory / references it created.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8490
2019-03-14 18:14:36 -07:00
Tom Caputi
eaed840542 Better user experience for errata 4
This patch attempts to address some user concerns that have arisen
since errata 4 was introduced.

* The errata warning has been made less scary for users without
  any encrypted datasets.

* The errata warning now clears itself without a pool reimport if
  the bookmark_v2 feature is enabled and no encrypted datasets
  exist.

* It is no longer possible to create new encrypted datasets without
  enabling the bookmark_v2 feature, thus helping to ensure that the
  errata is resolved.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue ##8308
Closes #8504
2019-03-14 16:48:30 -07:00
Alexander Motin
1af240f3b5 Add separate aggregation limit for non-rotating media
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, supposedly to reduce number of
head seeks for spinning disks.  But for SSDs it makes little to no
sense, especially on FreeBSD, where due to MAXPHYS limitation device
will likely still see bunch of 128KB I/Os instead of one large.
Having more strict aggregation limit for SSDs allows to avoid
allocation of large memory buffer and copy to/from it, that is a
serious problem when throughput reaches gigabytes per second.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Closes #8494
2019-03-13 12:00:10 -07:00
Andrew Stormont
1814242379 OpenZFS 9914 - NV_UNIQUE_NAME_TYPE broken after 9580
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9914
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b8a5bee18
Closes #8496
2019-03-13 11:16:30 -07:00
Tom Caputi
f00ab3f22c Detect and prevent mixed raw and non-raw sends
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.

This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.

This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8308
2019-03-13 11:00:43 -07:00
Tom Caputi
579ce7c5ae Add bookmark v2 on-disk feature
This patch adds the bookmark v2 feature to the on-disk format. This
feature will be needed for the upcoming redacted sends and for an
upcoming fix that for raw receives. The feature is not currently
used by any code and thus this change is a no-op, aside from the
fact that the user can now enable the feature.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8308
2019-03-13 10:58:39 -07:00
Tom Caputi
369aa501d1 Fix handling of maxblkid for raw sends
Currently, the receive code can create an unreadable dataset from
a correct raw send stream. This is because it is currently
impossible to set maxblkid to a lower value without freeing the
associated object. This means truncating files on the send side
to a non-0 size could result in corruption. This patch solves this
issue by adding a new 'force' flag to dnode_new_blkid() which will
allow the raw receive code to force the DMU to accept the provided
maxblkid even if it is a lower value than the existing one.

For testing purposes the send_encrypted_files.ksh test has been
extended to include a variety of truncated files and multiple
snapshots. It also now leverages the xattrtest command to help
ensure raw receives correctly handle xattrs.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8168 
Closes #8487
2019-03-13 10:52:01 -07:00
Justin Gottula
cffa8372f4 Fix most zfs_arc_* mod params not actually being modifiable at runtime
Most of the zfs_arc_* module parameters do not have their values used by
the ARC code directly. Instead, there is a function, arc_tuning_update,
which is called during module initialization and periodically
thereafter, whose job is to fetch the module parameter values, clamp/
limit them appropriately, and then assign those values to a separate set
of internal variables that are actually referenced by the ARC code.

Commit 3ec34e55 featured an overhaul of arc_reclaim_thread, which is the
former location where the post-init-time calls to arc_tuning_update
would occur. The rework split the work previously done by the
arc_reclaim_thread into a pair of replacement threads; and
unfortunately, the call to arc_tuning_update fell through the cracks and
was lost in the reorganization.

This meant that changing almost any ARC-related zfs module parameter via
/sys/module/zfs/parameters/ would result in the module parameter value
itself appearing to change; however the modification would not actually
propagate to the ARC code and have any real effect.

This commit reinstates the post-init-time call to arc_tuning_update. It
is now called during arc_adjust_cb_check; this should be equivalent to
its former call location in arc_reclaim_thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #8405 
Closes #8463
2019-03-12 15:03:59 -07:00
Alek P
4c0883fb4a Avoid retrieving unused snapshot props
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077
2019-03-12 13:13:22 -07:00
Brian Behlendorf
dd785b5b86
Fix vdev_initialize_restart / removal race
Resolve a vdev_initialize crash uncovered by ztest.  Similar
to when starting a new initialization verify that a removal
is not in progress.  Additionally, do not restart when the
thread already exists.  This check is now congruent with the
POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl().

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8477
2019-03-12 10:39:47 -07:00
Olaf Faaland
3d31aad83e MMP writes rotate over leaves
Instead of choosing a leaf vdev quasi-randomly, by starting at the root
vdev and randomly choosing children, rotate over leaves to issue MMP
writes.  This fixes an issue in a pool whose top-level vdevs have
different numbers of leaves.

The issue is that the frequency at which individual leaves are chosen
for MMP writes is based not on the total number of leaves but based on
how many siblings the leaves have.

For example, in a pool like this:

       root-vdev
   +------+---------------+
vdev1                   vdev2
  |                       |
  |                +------+-----+-----+----+
disk1             disk2 disk3 disk4 disk5 disk6

vdev1 and vdev2 will each be chosen 50% of the time.  Every time vdev1
is chosen, disk1 will be chosen.  However, every time vdev2 is chosen,
disk2 is chosen 20% of the time.  As a result, disk1 will be sent 5x as
many MMP writes as disk2.

This may create wear issues in the case of SSDs.  It also reduces the
effectiveness of MMP as it depends on the writes being evenly
distributed for the case where some devices fail or are partitioned.

The new code maintains a list of leaf vdevs in the pool.  MMP records
the last leaf used for an MMP write in mmp->mmp_last_leaf.  To choose
the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list,
continuing from the head if the tail is reached.  It stops when a
suitable leaf is found or all leaves have been examined.

Added a test to verify MMP write distribution is even.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7953
2019-03-12 10:37:06 -07:00
George Wilson
b1b94e9644 zfs does not honor NFS sync write semantics
The linux kernel's nfsd implementation use RWF_SYNC to determine if the
write is synchronous or not. This flag is used to set the kernel's I/O
control block flags. Unfortunately, ZFS was not updated to inspect these
flags so NFS sync writes were not being honored.

This change maps the IOCB_* flags to the ZFS equivalent.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8474 
Closes #8452 
Closes #8486
2019-03-11 09:13:37 -07:00
mzhivich
1118f99449 Fix lockdep between ds_lock and dd_lock in dsl_dataset_namelen()
Booting debug kernel found an inconsistent lock dependency between
dataset's ds_lock and its directory's dd_lock.

[ 32.215336] ======================================================
[ 32.221859] WARNING: possible circular locking dependency detected
[ 32.221861] 4.14.90+ #8 Tainted: G           O
[ 32.221862] ------------------------------------------------------
[ 32.221863] dynamic_kernel_/4667 is trying to acquire lock:
[ 32.221864]  (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs]
[ 32.221941] but task is already holding lock:
[ 32.221941]  (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs]
[ 32.221983] which lock already depends on the new lock.
[ 32.221983] the existing dependency chain (in reverse order) is:
[ 32.221984] -> #1 (&dd->dd_lock){+.+.}:
[ 32.221992] 	__mutex_lock+0xef/0x14c0
[ 32.222049] 	dsl_dir_namelen+0xd4/0x2d0 [zfs]
[ 32.222093] 	dsl_dataset_namelen+0x2f1/0x430 [zfs]
[ 32.222142] 	verify_dataset_name_len+0xd/0x40 [zfs]
[ 32.222184] 	dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs]
[ 32.222226] 	dmu_objset_find_dp_cb+0x40/0x60 [zfs]
[ 32.222235] 	taskq_thread+0x969/0x1460 [spl]
[ 32.222238] 	kthread+0x2fb/0x400
[ 32.222241] 	ret_from_fork+0x3a/0x50

[ 32.222241] -> #0 (&ds->ds_lock){+.+.}:
[ 32.222246] 	lock_acquire+0x14f/0x390
[ 32.222248] 	__mutex_lock+0xef/0x14c0
[ 32.222291] 	dsl_dataset_check_quota+0x9e/0x8a0 [zfs]
[ 32.222355] 	dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs]
[ 32.222392] 	dmu_tx_assign+0xa61/0xdb0 [zfs]
[ 32.222436] 	zfs_create+0x4e6/0x11d0 [zfs]
[ 32.222481] 	zpl_create+0x194/0x340 [zfs]
[ 32.222484] 	lookup_open+0xa86/0x16f0
[ 32.222486] 	path_openat+0xe56/0x2490
[ 32.222488] 	do_filp_open+0x17f/0x260
[ 32.222490] 	do_sys_open+0x195/0x310
[ 32.222491] 	SyS_open+0xbf/0xf0
[ 32.222494] 	do_syscall_64+0x191/0x4f0
[ 32.222496] 	entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 32.222497] other info that might help us debug this:

[ 32.222497] Possible unsafe locking scenario:
[ 32.222498] CPU0 			CPU1
[ 32.222498] ---- 			----
[ 32.222499] lock(&dd->dd_lock);
[ 32.222500] 				lock(&ds->ds_lock);
[ 32.222502] 				lock(&dd->dd_lock);
[ 32.222503] lock(&ds->ds_lock);
[ 32.222504] *** DEADLOCK ***
[ 32.222505] 3 locks held by dynamic_kernel_/4667:
[ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0
[ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490
[ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs]

The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by
acquiring dd_lock on ds->ds_dir in dsl_dir_namelen().

However, ds->ds_dir should not be protected by ds_lock, so releasing it before
call to dsl_dir_namelen() prevents the lockdep issue

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by:  Michael Zhivich <mzhivich@akamai.com>
Closes #8413
2019-03-11 09:11:04 -07:00
Brian Behlendorf
b46fd243d5
Linux 5.1 compat: get_ds() removed
Commit torvalds/linux@736706bee has removed the get_fs() function
as a bit of cleanup.  It has been defined as KERNEL_DS on all
architectures for all supported kernels.  Replace get_fs() with
KERNEL_DS as was done in the kernel.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8479
2019-03-07 14:44:23 -08:00
Paul Zuchowski
a73e8fdb93 Stack overflow in recursive bpobj_iterate_impl
The function bpobj_iterate_impl overflows the stack when bpobjs
are deeply nested. Rewrite the function to eliminate the recursion.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7674
Closes #7675 
Closes #7908
2019-03-06 09:50:55 -08:00
Brian Behlendorf
96ebc5a1a4
Fix race in vdev_initialize_thread
Before allowing new allocations to the metaslab we need to ensure
that any issued initializing writes have been synced.  Otherwise,
it's possible for metaslab_block_alloc() to allocate a range which
is about to be overwritten by an initializing IO.

Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8461
2019-03-06 09:17:53 -08:00
Matthew Ahrens
0409679d88 Fix style of spl_kmem_cache_create()
Fix indentation of code in ifdef's.
Remove obsolete comment.
Make if/else statements more readable by adding braces.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8459
2019-02-28 17:57:47 -08:00
Olaf Faaland
8133679ff0 Do not resume a pool if multihost is enabled
When multihost is enabled, and a pool is suspended, return
EINVAL in response to "zpool clear <pool>".  The pool
may have been imported on another host while I/O was suspended.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6933 
Closes #8460
2019-02-28 17:56:19 -08:00
Matthew Ahrens
87c25d567f abd_alloc should use scatter for >1K allocations
abd_alloc() normally does scatter allocations, thus solving the problem
that ABD originally set out to: the bulk of ZFS's allocations are single
pages, which are faster to allocate and free, and don't suffer from
internal fragmentation (and the inability to reclaim memory because some
buffers in the slab are still allocated).

However, the current code does linear allocations for 4KB and smaller
allocations, defeating the purpose of ABD.

Scatter ABD's use at least one page each, so sub-page allocations waste
some space when allocated as scatter (e.g. 2KB scatter allocation wastes
half of each page).  Using linear ABD's for small allocations means that
they will be put on slabs which contain many allocations.  This can
improve memory efficiency, but it also makes it much harder for ARC
evictions to actually free pages, because all the buffers on one slab
need to be freed in order for the slab (and underlying pages) to be
freed.  Typically, 512B and 1KB kmem caches have 16 buffers per slab, so
it's possible for them to actually waste more memory than scatter (one
page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting
15/16th).

Spill blocks are typically 512B and are heavily used on systems running
selinux with the default dnode size and the `xattr=sa` property set.

By default we will use linear allocations for 512B and 1KB, and scatter
allocations for larger (1.5KB and up).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by:  DHE <git@dehacked.net>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8455
2019-02-28 17:52:55 -08:00
Brian Behlendorf
6af7ba417e
Fix overly broad spa config lock
The spa_txg_history_init_io() and spa_txg_history_fini_io() were
mistakenly taking SCL_ALL when only SCL_CONFIG is required to
access the vdev stats.  This could result in a deadlock which
was observed when running ztest.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8445
2019-02-27 10:49:22 -08:00
Serapheim Dimitropoulos
8eef997679 Error path in metaslab_load_impl() forgets to drop ms_sync_lock
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8444
2019-02-25 11:08:52 -08:00
loli10K
c44a3ec059 zvol: allow rename of in use ZVOL dataset
While ZFS allow renaming of in use ZVOLs at the DSL level without issues
the ZVOL layer does not correctly update the renamed dataset if the
device node is open (zv->zv_open_count > 0): trying to access the stale
dataset name, for instance during a zfs receive, will cause the
following failure:

VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00)
PANIC at zvol.c:1255:zvol_resume()
Showing stack for process 1390
CPU: 0 PID: 1390 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30
 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40
 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? rrw_exit+0xc8/0x2e0 [zfs]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs]
 [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs]
 [<0>] ? zvol_resume+0x178/0x280 [zfs]
 [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs]
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs]
 [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs]
 [<0>] ? __alloc_pages_nodemask+0x166/0xb50
 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs]
 [<0>] ? handle_mm_fault+0x464/0x1140
 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<0>] ? __do_page_fault+0x177/0x410
 [<0>] ? SyS_ioctl+0x81/0xa0
 [<0>] ? async_page_fault+0x28/0x30
 [<0>] ? system_call_fast_compare_end+0x10/0x15

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6263 
Closes #8371
2019-02-22 15:38:42 -08:00
loli10K
0c637f3100 zpool reports 16E expandsize on disks with oddball number of sectors
The issue is caused by a small discrepancy in how userland creates the
partition layout and the kernel estimates available space:

 * zpool command: subtract 9M from the usable device size, then align
   to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M
   EFI "reserved" partition.

 * kernel module: subtract 10M from the device size. 10M is the sum of
   1M "start" partition alignment + 1m "end" partition alignment + 8M
   EFI "reserved" partition.

For devices where the number of sectors is not a multiple of the
alignment size the zpool command will create a partition layout which
reserves less than 1M after the 8M EFI "reserved" partition:

  Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 512 bytes / 512 bytes
  Disklabel type: gpt
  Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F

  Device       Start     End Sectors  Size Type
  /dev/sda1     2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
  /dev/sda9  2078720 2095103   16384    8M Solaris reserved 1

When the kernel module vdev_open() the device its max_asize ends up
being slightly smaller than asize: this results in a huge number (16E)
reported by metaslab_class_expandable_space().

This change prevents bdev_max_capacity() from returing a size smaller
than bdev_capacity().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1468 
Closes #8391
2019-02-22 15:36:34 -08:00
lidongyang
8d9e51c084 Fix dnode_hold_impl() soft lockup
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes #8433
2019-02-22 09:48:37 -08:00
Tomohiro Kusumi
9abbee4912 Don't enter zvol's rangelock for read bio with size 0
The SCST driver (SCSI target driver implementation) and possibly
others may issue read bio's with a length of zero bytes. Although
this is unusual, such bio's issued under certain condition can cause
kernel oops, due to how rangelock is implemented.

rangelock_add_reader() is not made to handle overlap of two (or more)
ranges from read bio's with the same offset when one of them has size
of 0, even though they conceptually overlap. Allowing them to enter
rangelock results in kernel oops by dereferencing invalid pointer,
or assertion failure on AVL tree manipulation with debug enabled
kernel module.

For example, this happens when read bio whose (offset, size) is
(0, 0) enters rangelock followed by another read bio with (0, 4096)
when (0, 0) rangelock is still locked, when there are no pending
write bio's. It can also happen with reverse order, which is (0, N)
followed by (0, 0) when (0, N) is still locked. More details
mentioned in #8379.

Kernel Oops on ->make_request_fn() of ZFS volume
https://github.com/zfsonlinux/zfs/issues/8379

Prevent this by returning bio with size 0 as success without entering
rangelock. This has been done for write bio after checking flusher
bio case (though not for the same reason), but not for read bio.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8379 
Closes #8401
2019-02-20 10:14:36 -08:00
Serapheim Dimitropoulos
928e8ad47d Introduce auxiliary metaslab histograms
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.

The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram.  The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.

In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8358
2019-02-20 09:59:56 -08:00
loli10K
bb1be77a35 Prevent user accounting on readonly pool
Trying to mount a dataset from a readonly pool could inadvertently start
the user accounting upgrade task, leading to the following failure:

VERIFY3(tx->tx_threads == 2) failed (0 == 2)
PANIC at txg.c:680:txg_wait_synced()
Showing stack for process 2541
CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs]
 [<0>] ? dmu_object_next+0x77/0x130 [zfs]
 [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs]
 [<0>] ? txg_wait_synced+0x91/0x220 [zfs]
 [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs]
 [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs]
 [<0>] ? taskq_thread+0x2cc/0x5d0 [spl]
 [<0>] ? wake_up_state+0x10/0x10
 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl]
 [<0>] ? kthread+0xbd/0xe0
 [<0>] ? kthread_create_on_node+0x180/0x180
 [<0>] ? ret_from_fork+0x58/0x90
 [<0>] ? kthread_create_on_node+0x180/0x180

This patch updates both functions responsible for checking if we can
perform user accounting to verify the pool is not readonly.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8424
2019-02-19 18:41:18 -08:00
Sara Hartse
f545b6ae00 Delay injection can cause indefinitely hung zios
If we hit the (NSEC_TO_TICK(diff) == 0) condition in
zio_delay_interrupt, zio_interrupt is never called and the
zio does not progress.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8404
2019-02-15 14:44:56 -08:00
Tim Chase
638dd5f44e zio_deadman_impl() fix and enhancement
Add the zio_deadman_log_all tunable to print all zios in
zio_deadman_impl().  Also, in all cases, display the depth of the
zio relative to the original parent zio.  This is meant to be used by
developers to gain diagnostic information for hangs which don't involve
fully set-up zio trees or are otherwise stuck or hung in an early stage.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8362
2019-02-15 12:44:24 -08:00
Paul Zuchowski
9c5e88b1de zfs should optionally send holds
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7513
2019-02-15 12:41:38 -08:00
Tomohiro Kusumi
2d76ab9e42 Fix obsolete comment on rangelock
5d43cc9a59 renamed it to rangelock_enter().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8408
2019-02-14 15:13:58 -08:00
Alek P
65282ee9e0 Freeing throttle should account for holes
Deletion throttle currently does not account for holes in a file.
This means that it can activate when it shouldn't.
To fix it we switch the throttle to be based on the number of
L1 blocks we will have to dirty when freeing

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7725 
Closes #7888
2019-02-12 12:01:08 -08:00
Alek P
dcec0a12c8 port async unlinked drain from illumos-nexenta
This patch is an async implementation of the existing sync
zfs_unlinked_drain() function. This function is called at mount time and
is responsible for freeing znodes that we didn't get to freeing before.
We don't have to hold mounting of the dataset until the unlinked list is
fully drained as is done now. Since we can process the unlinked set
asynchronously this results in a better user experience when mounting a
dataset with entries in the unlinked set.

Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8142
2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos
425d3237ee Get rid of space_map_update() for ms_synced_length
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8328
2019-02-12 10:38:11 -08:00
loli10K
d8d418ff0c ZVOLs should not be allowed to have children
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
2019-02-08 15:44:15 -08:00
loli10K
4417096956 Pool allocation classes misplacing small file blocks
Due to an off-by-one condition in spa_preferred_class() we are picking
the "normal" allocation class instead of the "special" one for file
blocks with size equal to the special_small_blocks property value.

This change fix the small code issue, update the ZFS Test Suite and the
zfs(8) man page.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8351
Closes #8361
2019-02-08 12:32:12 -08:00
Tim Chase
0902c4577f Fix ARC stats for embedded blkptrs
Re-factor arc_read() to better account for embedded data blkptrs.
Previously, reading the payload from an embedded blkptr would cause
arcstats such as demand_metadata_misses to be bumped when there was
actually no cache "miss" because the data are already available in
the blkptr.

The following test procedure was used to demonstrate the problem:

   zpool create tank ...
   zfs create -o compression=lz4 tank/fs
   echo blah > /tank/fs/blah
   stat /tank/fs/blah
   grep 'meta.*mis' /proc/spl/kstat/zfs/arcstats

and repeating the last two steps to watch the metadata miss counter
increment.  This can also be demonstrated via the  zfs_arc_miss DTRACE4
probe in arc_read().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8319
2019-02-04 09:33:30 -08:00
Serapheim Dimitropoulos
6c926f426a Simplify log vdev removal code
Get rid of the majority metaslab metadata when removing log vdevs
in spa_vdev_remove_log() with a call to metaslab_fini() instead
of duplicating a lot of that in vdev_remove_empty_log().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8347
2019-01-31 09:17:52 -08:00
Serapheim Dimitropoulos
7558997d2f vs_alloc can underflow in L2ARC vdevs
The current L2 ARC device code consistently uses psize to
increment vs_alloc but varies between psize and lsize when
decrementing it. The result of this behavior is that
vs_alloc can be decremented more that it is incremented
and underflow. This patch changes the code so asize is
used anywhere.

In addition, it ensures that vs_alloc gets incremented by
the L2 ARC device code as buffers are written and not at
the end of the l2arc_write_buffers() routine. The latter
(and old) way would temporarily underflow vs_alloc as
buffers that were just written, would be destroyed while
l2arc_write_buffers() was still looping.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8298
2019-01-31 09:16:39 -08:00
Sara Hartse
2747f599ff Don't acquire zthr_request_lock in zthr_wakeup
Address a deadlock caused by simultaneous wakeup and cancel on a zthr
by remove the hold of zthr_request_lock from zthr_wakeup. This
allows thr_wakeup to not block a thread that is in the process of
being cancelled.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8333
2019-01-30 12:31:16 -08:00
Brian Behlendorf
26a856594f Linux 5.0 compat: Fix bio_set_dev()
The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the
GPL-only bio_associate_blkg() symbol thus inadvertently converting
the entire macro.  Provide a minimal version which always assigns the
request queue's root_blkg to the bio.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8287
2019-01-28 10:11:45 -08:00
Tony Hutter
ed158b19b1 Linux 5.0 compat: Fix SUBDIRs
SUBDIRs has been deprecated for a long time, and was finally removed in
the 5.0 kernel.  Use "M=" instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8257
2019-01-28 10:11:45 -08:00
Tony Hutter
05805494dd Linux 5.0 compat: Convert MS_* macros to SB_*
In the 5.0 kernel, only the mount namespace code should use the MS_*
macos. Filesystems should use the SB_* ones.

https://patchwork.kernel.org/patch/10552493/

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8264
2019-01-28 10:11:39 -08:00
Tony Hutter
031cea17a3 Linux 5.0 compat: Use totalram_pages()
totalram_pages() was converted to an atomic variable in 5.0:

https://patchwork.kernel.org/patch/10652795/

Its value should now be read though the totalram_pages() helper
function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8263
2019-01-28 10:11:14 -08:00
Tony Hutter
77e50c3070 Linux 5.0 compat: access_ok() drops 'type' parameter
access_ok no longer needs a 'type' parameter in the 5.0 kernel.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8261
2019-01-28 10:11:10 -08:00
Serapheim Dimitropoulos
c853f382db Change target size of metaslabs from 256GB to 16GB
= Old behavior

For vdev sizes 100GB to 50TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 256GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.

= New Behavior

For vdev sizes 100GB to 3TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 16GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.

= Reasoning

The old behavior makes metaslabs grow in size when
the vdev range is between 3TB (ms_size 16GB) and
32PB (ms_size 256GB). Even though keeping the number
of metaslabs is good in terms of potential number of
I/Os per TXG, these bigger metaslabs take longer
to be loaded and after they are loaded they can
take up a lot of memory because of their range trees.

This change tries to put a boundary in memory and
loading time for the specific range of vdev sizes.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8324
2019-01-25 16:38:27 -08:00
Serapheim Dimitropoulos
df72b8bebe Rename range_tree_verify to range_tree_verify_not_present
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8327
2019-01-25 09:51:24 -08:00
Tim Chase
107dd2b174 Use proper tag for spa config refcounts in mmp_write_uberblock()
This allows the spa config refcounts to use tracking in debug builds
without triggering the "No such hold %p on refcount" panic.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8326
2019-01-25 09:50:06 -08:00
Tom Caputi
b5d693581d Fix bad kmem_free() in zvol_rename_minors_impl()
Currently, zvol_rename_minors_impl() calls kmem_asprintf()
to allocate and initialize a string. This function is a thin
wrapper around the kernel's kvasprintf() and does not call
into the SPL's kmem tracking code when it is enabled. However,
this function frees the string with the tracked kmem_free()
instead of the untracked strfree(), which causes the SPL
kmem tracking code to believe that the function is attempting
to free memory it never allocated, triggering an ASSERT. This
patch simply corrects this issue.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8307
2019-01-23 11:38:05 -08:00
loli10K
0a10863194 ztest: creates partially initialized root dataset
Since d8fdfc2 was integrated dsl_pool_create() does not call
dmu_objset_create_impl() for the root dataset when running in
userland (ztest): this creates a pool with a partially initialized
root dataset. Trying to import and use this pool results in both
zpool and zfs executables dumping core.

Fix this by adopting an alternative change suggested in OpenZFS 8607
code review.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Original-patch-by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8277
2019-01-18 11:14:01 -08:00
Brian Behlendorf
ad63507135
Remove zfs_sync() panicking kernel check
This check provides no real additional protection and unnecessarily
introduces a dependency on the "oops_in_progress" kernel symbol.
Remove the check, it there are special circumstances on other
platforms which make this a requirement it can be reintroduced
for all relevant call paths in a more portable comprehensive manor.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8297
2019-01-18 11:11:47 -08:00
Serapheim Dimitropoulos
b194fab0fb Factor metaslab_load_wait() in metaslab_load()
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.

Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8290
2019-01-18 11:10:32 -08:00
Tom Caputi
960347d3a6 Fix 0 byte memory leak in zfs receive
Currently, when a DRR_OBJECT record is read into memory in
receive_read_record(), memory is allocated for the bonus buffer.
However, if the object doesn't have a bonus buffer the code will
still "allocate" the zero bytes, but the memory will not be passed
to the processing thread for cleanup later. This causes the spl
kmem tracking code to report a leak. This patch simply changes the
code so that it only allocates this memory if it has a non-zero
length.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8266
2019-01-18 11:06:48 -08:00
loli10K
60b0a963f5 Off-by-one in zap_leaf_array_create()
Trying to set user properties with their length 1 byte shorter than the
maximum size triggers an assertion failure in zap_leaf_array_create():

  panic[cpu0]/thread=ffffff000a092c40:
  assertion failed: num_integers * integer_size < (8<<10) (0x2000 < 0x2000), file: ../../common/fs/zfs/zap_leaf.c, line: 233

  ffffff000a092500 genunix:process_type+167c35 ()
  ffffff000a0925a0 zfs:zap_leaf_array_create+1d2 ()
  ffffff000a092650 zfs:zap_entry_create+1be ()
  ffffff000a092720 zfs:fzap_update+ed ()
  ffffff000a0927d0 zfs:zap_update+1a5 ()
  ffffff000a0928d0 zfs:dsl_prop_set_sync_impl+5c6 ()
  ffffff000a092970 zfs:dsl_props_set_sync_impl+fc ()
  ffffff000a0929b0 zfs:dsl_props_set_sync+79 ()
  ffffff000a0929f0 zfs:dsl_sync_task_sync+10a ()
  ffffff000a092a80 zfs:dsl_pool_sync+3a3 ()
  ffffff000a092b50 zfs:spa_sync+4e6 ()
  ffffff000a092c20 zfs:txg_sync_thread+297 ()
  ffffff000a092c30 unix:thread_start+8 ()

This patch simply corrects the assertion.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8278
2019-01-18 09:58:46 -08:00
Serapheim Dimitropoulos
8dc2197b7b Simplify spa_sync by breaking it up to smaller functions
The point of this refactoring is to break the high-level conceptual 
steps of spa_sync() to their own helper functions. In general large 
functions can enhance readability if structured well, but in this
case the amount of conceptual steps taken could use the help of 
helper functions.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8293
2019-01-18 09:50:16 -08:00
Tom Caputi
305781da4b Fix error handling incallers of dbuf_hold_level()
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8291
2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos
75058f3303 Remove unused vdev_t fields
The following fields from the vdev_t struct are not used anywhere.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8285
2019-01-17 15:41:12 -08:00
Brian Behlendorf
52b684236d
ztest: scrub ddt repair
The ztest_ddt_repair() test is designed inflict damage to the
ddt which can be repairable by a scrub.  Unfortunately, this
repair logic was broken at some point and it went undetected.
This issue is not specific to ztest, but thankfully this extra
redundancy is rarely enabled and even more rarely needed.

The root cause was identified to be the ddt_bp_create()
function called by dsl_scan_ddt_entry() which did not set the
dedup bit of the generated block pointer.

The consequence of this was that the ZIO_DDT_READ_PIPELINE was
never enabled for the block pointer during the scrub, and the
dedup ditto repair logic was never run.  Note that for demand
reads which don't rely on ddt_bp_create() the required pipeline
stages would be enabled and the repair performed.

This was resolved by unconditionally setting the dedup bit in
ddt_bp_create().  This way all codes paths which may need to
perform a repair from a block pointer generated from the dtt
entry will be able too.  The only exception is that the dedup
bit is cleared in ddt_phys_free() which is required to avoid
leaking space.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8270
2019-01-17 15:25:00 -08:00
Serapheim Dimitropoulos
419ba59145 Update vdev_is_spacemap_addressable() for new spacemap encoding
Since the new spacemap encoding was ported to ZoL that's no longer 
a limitation. This patch updates vdev_is_spacemap_addressable() 
that was performing that check.

It also updates the appropriate test to ensure that the same 
functionality is tested.  The test does so by creating pools that 
don't have the new spacemap encoding enabled - just the checkpoint
feature. This patch also reorganizes that same tests in order to 
cut in half its memory consumption.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8286
2019-01-16 15:06:20 -08:00
Brian Behlendorf
64bdf63f5c
ztest: split block reconstruction
Increase the default allowed number of reconstruction attempts.
There's not an exact right number for this setting.  It needs
to be set large enough to cover any realistic failure scenarios
and small enough to avoid stalling the IO pipeline and invoking
the dead man detection.

The current value of 256 was empirically determined to be too
low based on multi-day runs of ztest.  The fault injection code
would inject more damage than could be reconstructed given the
relatively small number of attempts.  However, in all observed
cases the block could be reconstructed using a slightly higher
limit.

Based on local testing increasing the default value to 4096 was
determined to strike the best balance.  Checking all combinations
takes less than 10s in the worst case, and has so far eliminated
the vast majority of false positives detected by ztest.  This
delay is roughly on par with how long retries may be performed
to a misbehaving HDD and was deemed to be reasonable.  Better to
err on the side of a brief delay rather than fail to reconstruct
the data.

Lastly, the -Y flag has been added to zdb to make it easy to try all
possible combinations when performing split block reconstruction.
For badly damaged blocks with 18 splits, they can be fully enumerated
within a few minutes.  This has been done to ensure permanent errors
are never incorrectly reported when ztest verifies the pool with zdb.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8271
2019-01-16 14:10:02 -08:00
Tom Caputi
5e7f3ace58 Fix zio leak in dbuf_read()
Currently, dbuf_read() may decide to create a zio_root which is
used as a parent for any child zios created in dbuf_read_impl().
However, if there is an error in dbuf_read_impl(), this zio is
never executed and ends up leaked. This patch simply ensures
that we always execute the root zio, even i it has no real work
to do.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8267
2019-01-15 12:23:40 -08:00
Brian Behlendorf
d611989fdc
Minor spelling corrections
Some minor spelling mistakes and typos.  No functional changes.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8272
2019-01-13 10:11:52 -08:00
Serapheim Dimitropoulos
61c3391acc Serialize ZTHR operations to eliminate races
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8229
2019-01-13 10:09:46 -08:00
Paul Zuchowski
83c796c5e9 zfs filesystem skipped by df -h
On full pool when pool root filesystem references very few bytes,
the f_blocks returned to statvfs is 0 but should be at least 1.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8253 
Closes #8254
2019-01-13 10:06:13 -08:00
Brian Behlendorf
6955b40138
Provide more flexible object allocation interface
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.

The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.

The zap_create_* functions have been updated to take advantage of
this new functionality.  The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().

The new symbols have been exported for use by Lustre.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8015
2019-01-10 14:37:43 -08:00
Tom Caputi
58769a4ebd Don't allow dnode allocation if dn_holds != 0
This patch simply fixes a small bug where dnode_hold_impl() could
attempt to allocate a dnode that was in the process of being freed,
but which still had active references. This patch simply adds the
required check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8249
2019-01-10 14:36:23 -08:00
loli10K
0f5f23869a zfs receive and rollback can skew filesystem_count
This commit fixes a small issue which causes both zfs receive and
rollback operations to incorrectly increase the "filesystem_count"
property value.

This change also adds a new test group "limits" to the ZFS Test Suite
to exercise both filesystem_count/limit and snapshot_count/limit
functionality.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8232
2019-01-08 10:17:46 -08:00
asomers
f384c045d8 OpenZFS 8473 - scrub does not detect errors on active spares
Scrubbing is supposed to detect and repair all errors in the pool.
However, it wrongly ignores active spare devices. The problem can
easily be reproduced in OpenZFS at git rev 0ef125d with these
commands:

    truncate -s 64m /tmp/a /tmp/b /tmp/c
    sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
    sudo zpool replace testpool /tmp/a /tmp/c
    /bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
    sync
    sudo zpool scrub testpool
    zpool status testpool # Will show 0 errors, which is wrong
    sudo zpool offline testpool /tmp/a
    sudo zpool scrub testpool
    zpool status testpool # Will show errors on /tmp/c,
                          # which should've already been fixed

FreeBSD head is partially affected: the first scrub will detect
some errors, but the second scrub will detect more.  This same
test was run on Linux before applying the fix and the FreeBSD
head behavior was observed.

Authored by: asomers <asomers@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored by: Spectra Logic Corp

OpenZFS-issue: https://www.illumos.org/issues/8473
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/e20ec8879
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/554675ee
Closes #8251
2019-01-08 09:51:30 -08:00
George Wilson
c10d37dd9f zfs initialize performance enhancements
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8230
2019-01-07 11:03:08 -08:00
George Wilson
619f097693 OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2019-01-07 10:37:26 -08:00
Brian Behlendorf
0b8e4418b6
Add zfs module feature and property compatibility
This change is required to ease the transition when upgrading
from 0.7.x to 0.8.x.  It allows 0.8.x user space utilities to
remain compatible with 0.7.x and older kernel modules.

Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8231
2019-01-03 15:19:37 -08:00
Brian Behlendorf
65ca2c1eb9
Fix 'zpool remap' freeing race
The dmu_objset_remap_indirects_impl() logic depends on dnode_hold()
returning ENOENT for dnodes which will be freed and should be skipped.

This behavior can only be relied upon when taking a new hold and
while the caller has an open transaction.  This ensures that the
open txg cannot advance and that a concurrent free will end up
in the same txg (which is critical).  Relying on an existing hold
will not prevent dnode_free() from succeeding.

The solution is to take an additional dnode_hold() after assigning
the transaction.  This ensures the remap will never dirty the dnode
if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT).

Randomly set zfs_object_remap_one_indirect_delay_ms in ztest.  This
increases the likelihood of an operation racing with the remap.
Converted from ticks to milliseconds.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8215
2019-01-02 11:46:04 -08:00
Brad Lewis
3ec34e5527 OpenZFS 9284 - arc_reclaim_thread has 2 jobs
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping.  However, the code is still confusing and error-prone,
because this thread has two responsibilities.  We should instead
separate this into two threads each with their own responsibility:

 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

 2. keep enough free memory in the system, by calling
    `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
    `arc_available_memory()`.

Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by:  Brad Lewis <brad.lewis@delphix.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9
Closes #8165
2018-12-26 13:22:28 -08:00
Tom Caputi
00f198de6b Fix zfs_dirty_data_sync_percent documentation
In dfbe2675 zfs_dirty_data_sync was changed to a new tunable named
zfs_dirty_data_sync_percent. Unfortunately, the module parameter
documentation is the code was not updated accordingly. This patch
simply corrects that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8212
2018-12-18 14:47:33 -08:00
Tom Caputi
2a6078450d Fix zap_update() ASSERT from ztest
This patch simply removes an invalid assert from the zap_update()
function. The ASSERT is invalid because it does not hold the zap
lock from the time it fetches the old value to the time it confirms
that it is what it should be.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8209
2018-12-14 10:04:11 -08:00
Andriy Gapon
dc1c630b8a OpenZFS 9630 - add lzc_rename and lzc_destroy to libzfs_core
Porting Notes:
* Additional changes to recv_rename_impl() were required due to
  encryption code not being merged in OpenZFS yet.
* libzfs_core python bindings (pyzfs) were updated to fully support
  both lzc_rename() and lzc_destroy()

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: loli10K <ezomori.nozomu@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9630
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63
Closes #8207
2018-12-14 09:49:45 -08:00
Tom Caputi
5aa95ba0d3 Fix resilver writes in vdev_indirect_io_start
This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8193
2018-12-13 14:18:48 -08:00
Richard Elling
a48cd034c8 Seeing negative values for wlentime and rlentime
Linux kstat IO and TIMER printed values as signed. However the counters
only increment. Thus humans looking at the data can be confused when
the counters roll over.

Note: The recommended use of these values is to monitor the derivative,
which don't really care about the sign. See explanations related to
non-negative derivatives in the various time-series databases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8131 
Closes #8198
2018-12-11 13:56:54 -08:00
Olaf Faaland
fa61e72340 Rename macro ZFS_MINOR due to Lustre conflict
Macro ZFS_MINOR, introduced in commit a6cc9756 to record the chosen
static minor number for /dev/zfs, conflicts with an existing macro
in Lustre.  The lustre macro (along with _MAJOR, _PATCH, _FIX) is
used to record the zfsonlinux version Lustre is being built against.

Since the Lustre macro came first, and is used in past versions of
lustre at least going back to 2.10, it makes sense to rename the
macro in ZFS instead of doing so in Lustre which would require
backporting the patch.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8195
2018-12-10 16:52:49 -08:00
Prakash Surya
900d09b285 OpenZFS 9962 - zil_commit should omit cache thrash
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.

Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).

Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.

The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).

For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.

Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.

This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).

Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).

Porting Notes:
* Reconciled conflicts with the fastwrite feature.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
2018-12-07 11:09:42 -08:00
Prakash Surya
53b1f5eac6 OpenZFS 9963 - Separate tunable for disabling ZIL vdev flush
Porting Notes:
* Add options to zfs-module-parameters(5) man page.
* zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since
  the latter doesn't get built for user space.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9963
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125
Closes #8186
2018-12-07 11:06:29 -08:00
George Wilson
18b14b17c8 OpenZFS 9993 - zil writes can get delayed in zio pipeline
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9993
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2258ad0b
Closes #8185
2018-12-07 11:05:35 -08:00
Tom Caputi
d6496040d9 Ensure dsl scan prefetch queue is emptied
This patch simply ensures that scn->scn_prefetch_queue is emptied
before the kernel module is unloaded and when scanning completes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8178
2018-12-06 09:47:23 -08:00
Brian Behlendorf
78e2139467
Fix dnode_hold() freeing dnode behavior
Commit 4c5b89f59 refactored dnode_hold() and in the process
accidentally introduced a slight change in behavior which was
not intended.  The required behavior is that once the ZPL,
or other consumer, declares its intent to free a dnode then
dnode_hold() should immediately start failing.  This updated
code wouldn't return the failure until after it was freed.

When DNODE_MUST_BE_ALLOCATED is set it must return ENOENT, and
when DNODE_MUST_BE_FREE is set it must return EEXIST;

This issue was uncovered by ztest_remap() which attempted
to remap a freeing object which should have been skipped as
described by the comment in dmu_objset_remap_indirects_impl().

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8172
2018-12-05 09:29:33 -08:00
Brian Behlendorf
c5eea0ab9c
Fix 'zpool list -v' alignment
The verbose output of 'zpool list' was not correctly aligned due
to differences in the vdev name lengths.  Minimally update the
code the correct the alignment using the same strategy employed
by 'zpool status'.

Missing dashes were added for the empty defaults columns, and
the vdev state is now printed for all vdevs.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7308 
Closes #8147
2018-12-04 10:17:54 -08:00
Tom Caputi
fedef6dd59 Fix ztest deadlock in spa_vdev_remove()
This patch corrects an issue where spa_vdev_remove() would
call spa_history_log_internal() while holding the spa config
lock. This function may decide to block until the next txg if
the current one seems too full. However, since the thread is
holding the config log, the txg sync thread cannot progress
and the system ends up deadlocked. This patch simply moves
all calls to spa_history_log_internal() outside of the config
lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8162
2018-12-04 09:54:05 -08:00
Brian Behlendorf
7c9a42921e
Detect IO errors during device removal
* Detect IO errors during device removal

While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs.  Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.

Situation 1: faulted/degraded vdevs

In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool.  Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'.  Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable.  However, device removal
will complete successfully since all IO errors are ignored.

  tank                DEGRADED     0     0     0
    mirror-0          DEGRADED     0     0     0
      /var/tmp/vdev1  FAULTED      0     0     0  external fault
      /var/tmp/vdev2  ONLINE       0     0     0
    mirror-1          DEGRADED     0     0     0
      /var/tmp/vdev3  ONLINE       0     0     0
      /var/tmp/vdev4  FAULTED      0     0     0  external fault

This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs.  Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.

Situation 2: individual hard IO errors

During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled.  While it may be possible to reconstruct
the data after removal that cannot be guaranteed.  The only
strictly safe thing to do is to cancel the removal.

As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried.  But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900
Closes #8161
2018-12-04 09:37:37 -08:00
Tom Caputi
c40a1124e1 Fix consistency of ztest_device_removal_active
ztest currently uses the boolean flag ztest_device_removal_active
to protect some tests that may not run successfully if they occur
at the same time as ztest_device_removal(). Unfortunately, in the
event that ztest is in the middle of a device removal when it
decides to issue a SIGKILL, the device removal will be
automatically restarted (without setting the flag) when the pool
is re-imported on the next run. This patch corrects this by
ensuring that any in-progress removals are completed before running
further tests after the re-import.

This patch also makes a few small changes to prevent race conditions
involving the creation and destruction of spa->spa_vdev_removal,
since this field is not protected by any locks. Some checks that
may run concurrently with setting / unsetting this field have been
updated to check spa->spa_removing_phys.sr_state instead. The most
significant change here is that spa_removal_get_stats() no longer
accounts for in-flight work done, since that could result in a NULL
pointer dereference.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8105
2018-11-28 20:47:09 -08:00
LOLi
c71c8c715b zfs_dbgmsg() is not safe from every context
This commit reverts to using printk() instead of zfs_dbgmsg() to log
messages in vdev_disk_error(): this is necessary because the latter can
be called from interrupt context where we are not allowed to sleep.
Unfortunately zfs_dbgmsg() performs its allocations calling kmalloc()
with the KM_SLEEP flag which may result in the following oops:

   BUG: scheduling while atomic: swapper/4/0/0x10000100
	Call Trace:
	<IRQ>  [<0>] dump_stack+0x19/0x1b
	...
	[<0>] spl_kmem_alloc+0xdf/0x140 [spl] <-- kmem_alloc(size, KM_SLEEP)
	[<0>] __dprintf+0x69/0x150 [zfs]
	[<0>] ? kmem_cache_free+0x1e2/0x200
	[<0>] vdev_disk_error.part.15+0x5f/0x70 [zfs]
	[<0>] vdev_disk_io_flush_completion+0x48/0x70 [zfs]
	[<0>] bio_endio+0x67/0xb0
	[<0>] blk_update_request+0x90/0x360
	...
	[<0>] scsi_finish_command+0xdc/0x140
	[<0>] scsi_softirq_done+0x132/0x160
	[<0>] blk_done_softirq+0x96/0xc0
	[<0>] __do_softirq+0xf5/0x280
	[<0>] call_softirq+0x1c/0x30
	[<0>] do_softirq+0x65/0xa0
	[<0>] irq_exit+0x105/0x110
	[<0>] do_IRQ+0x56/0xf0
	[<0>] common_interrupt+0x162/0x162
	<EOI>  [<0>] ? cpuidle_enter_state+0x54/0xd0
	[<0>] cpuidle_idle_call+0xde/0x230
	[<0>] arch_cpu_idle+0xe/0xb0
	[<0>] cpu_startup_entry+0x14a/0x1e0
	[<0>] start_secondary+0x1f7/0x270
	[<0>] start_cpu+0x5/0x14

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8137 
Closes #8150
2018-11-28 11:29:57 -08:00
Tom Caputi
cef48f14da Remove races from scrub / resilver tests
Currently, several tests in the ZFS Test Suite that attempt to
test scrub and resilver behavior occasionally fail. A big reason
for this is that these tests use a combination of zinject and
zfs_scan_vdev_limit to attempt to slow these operations enough
to verify their test commands. This method works most of the time,
but provides no guarantees and leads to flaky behavior. This patch
adds a new tunable, zfs_scan_suspend_progress, that ensures that
scans make no progress, guaranteeing that tests can be run without
racing.

This patch also changes zfs_remove_max_bytes_pause to match this
new tunable. This provides some consistency between these two
similar tunables and ensures that the tunable will not misbehave
on 32-bit systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8111
2018-11-28 10:12:08 -08:00
LOLi
c8fd652ce7 Fix coverity defects: CID 184285
CID 184285: Read from pointer after free (USE_AFTER_FREE)

This patch fixes an use-after-free in vdev_config_generate_stats()
moving the kmem_free() call at the end of the function.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8120
2018-11-11 18:09:00 -08:00
loli10K
d48091de81 zed: detect and offline physically removed devices
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.

We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1537
Closes #7926
2018-11-09 11:17:24 -08:00
Tony Hutter
ad796b8a3b Add zpool status -s (slow I/Os) and -p (parseable)
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.

 	NAME         STATE     READ WRITE CKSUM  SLOW
 	testpool     ONLINE       0     0     0     -
	  mirror-0   ONLINE       0     0     0     -
 	    loop0    ONLINE       0     0     0    20
 	    loop1    ONLINE       0     0     0     0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7756
Closes #6885
2018-11-08 16:47:24 -08:00
George Melikov
877d925a9e Update zfs_admin_snapshot value (disabled)
It's disabled by default, update code and tests to reflect
the documentation.

Minor cleanup in delegate_common.kshlib.

Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7835 
Closes #8045
2018-11-08 16:17:12 -08:00
Tom Caputi
20eb30d08e Fix divide by zero during indirect split damage
This patch simply ensures that vdev_indirect_splits_damage()
cannot hit a divide by zero exception if a split has no
children with valid data. The normal reconstruction code
path in vdev_indirect_reconstruct_io_done() already has this
check.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8086
2018-11-07 15:44:56 -08:00
Tom Caputi
fde25c0a87 Fix dirtying vdev config on with RO spa
This patch simply corrects an issue where vdev_dtl_reassess()
could attempt to dirty the vdev config even when the spa was
not elligable for writing.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8085
2018-11-07 15:44:14 -08:00
Tom Caputi
f44ad9297d Replay logs before starting ztest workers
This patch ensures that logs are replayed on all datasets prior
to starting ztest workers. This ensures that the call to
vdev_offline() a log device in ztest_fault_inject() will not fail
due to the log device being required for replay.

This patch also fixes a small issue found during testing where
spa_keystore_load_wkey() does not check that the dataset specified
is an encryption root. This check was present in libzfs, however.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8084
2018-11-07 15:40:24 -08:00
Tom Caputi
ac53e50f79 Fix vdev removal finishing race
This patch fixes a race condition where the end of
vdev_remove_replace_with_indirect(), which holds
svr_lock, would race against spa_vdev_removal_destroy(),
which destroys the same lock and is called asynchronously
via dsl_sync_task_nowait().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6900 
Closes #8083
2018-11-07 15:38:10 -08:00
Tom Caputi
4021ba4cfa Make vdev_set_deferred_resilver() recursive
vdev_clear() can call vdev_set_deferred_resilver() with a
non-leaf vdev to setup a deferred resilver. However, this
function is currently written to only handle leaf vdevs.
This bug was introduced with deferred resilvers in 80a91e74.
This patch makes this function recursive so that it can find
appropriate vdevs to resilver and set vdev_resilver_deferred
on them.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #7732
Closes #8082
2018-11-07 15:33:17 -08:00
Brian Behlendorf
09b85f2ded
ztest: reduce gangblock creation
In order to validate the gang block code ztest is configured to
artificially force a fraction of large blocks to be written as
gang blocks.  The default setting chosen for this was to
write 25% of all blocks 32k or larger using gang blocks.

The confluence of an unrealistically large number of gang blocks,
the aggressive fault injection done by ztest, and the split
segment reconstruction logic introduced by device removal has
resulted in the following type of failure:

  zdb -bccsv -G -d ... exit code 3

Specifically, zdb was unable to open the pool because it was
unable to reconstruct a damaged block.  Manual investigation
of multiple failures clearly showed that the block could be
reconstructed.  However, due to the large number of damaged
segments (>35) it could not be done in the allotted time.

Furthermore, the large number of gang blocks was determined
to be the reason for the unrealistically large number of
damaged segments.  In order to make this situation less
likely, this change both increases the forced gang block
size to 64k and reduces the frequency to 3% of blocks.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8080
2018-11-05 11:53:49 -08:00
Don Brady
e89f1295d4 Add libzutil for libzfs or libzpool consumers
Adds a libzutil for utility functions that are common to libzfs and
libzpool consumers (most of what was in libzfs_import.c).  This
removes the need for utilities to link against both libzpool and
libzfs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8050
2018-11-05 11:22:33 -08:00
Matthew Ahrens
9553c533a6 bpobj_enqueue_subobj() should copy small subobj's
When we delete a snapshot, we consolidate some bpobj's together because
we no longer need to keep their entries in separate buckets.  This is
done in constant time by including the "sub" bpobj by reference in the
parent bpobj.

After many snapshots have been deleted, we may have many sub-bpobj's.
Usually, most sub-bpobj's don't contain many BP's.  Compared to this
small payload, the sub-bpobj is relatively heavyweight since it is a
object in the MOS.  A common scenario on a long-lived pool is for the
vast majority of MOS objects to be small sub-bpobj's.

To improve this situation, when consolidating bpobj's together,
bpobj_enqueue_subobj() can copy the contents of small bpobj's into the
parent, and then delete the enqueued bpobj, rather than including it by
reference.  Since this copying is limited in size (to one block), the
consolidation is still constant time, though with a larger constant due
to reading in the one block of the enqueued bpobj.

This idea and mechanism are similar to how we handle "sub-subobj's".
When including a sub-bpobj by reference, if the sub-bpobj itself has
less than a block of sub-sub-bpobj's, the list of sub-sub-bpobj's is
copied to the parent bpobj's list of sub-bpobj's.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8053 
Issue #7908
2018-10-31 11:58:17 -05:00
Tom Caputi
8cb119e3dc Fix 2 small bugs with cached dsl_scan_phys_t
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:41 -07:00
Tom Caputi
7ab96299e5 Fix ENXIO from spa_ld_verify_logs() in ztest
This patch fixes a small issue where the zil_check_log_chain()
code path would hit an EBUSY error. This would occur when
2 threads attempted to call metaslab_activate() at the same time.
In this case, the "loser" would receive an error code which should
have been ignored, but was instead floated to the caller. This
ended up resulting in an ENXIO being returned from from
spa_ld_verify_logs().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:33 -07:00
Tom Caputi
4a7eb69a5a Fix ztest deadman panic with indirect vdev damage
This patch fixes an issue where ztest's deadman thread would
trigger a panic because reconstructing artifically damaged
blocks would take too long to reconstruct. This patch simply
limits how often ztest inflicts split-block damage and how
many segments it can damage when it does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:31 -07:00
Tom Caputi
5e0bd0ae05 Fix issue with scanning dedup blocks as scan ends
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:15 -07:00
Tom Caputi
a783dd9684 Fix lock inversion in txg_sync_thread()
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.

Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:02 -07:00
Tom Caputi
ab4c009e3d Fix dbgmsg printing in ztest and zdb
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:50 -07:00
Tom Caputi
c04812f964 Fix ASSERT in zil_create() during ztest
This patch corrects an ASSERT in zil_create() that will only be
true if the call to zio_alloc_zil() does not fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:40 -07:00
Tom Caputi
9410257800 Fix random ztest_deadman_thread failures
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:21 -07:00
Paul Dagnelie
ae3d849142 OpenZFS 9688 - aggsum_fini leaks memory
Porting Notes:
- Most of these fixes were applied in the original 37fb3e43
  commit when this change was ported for Linux.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9688
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bf2d68be
Closes #8042
2018-10-19 12:08:03 -07:00
Serapheim Dimitropoulos
9b2266e3d8 OpenZFS 9682 - page fault in dsl_async_clone_destroy() while opening pool
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/9682
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ade2c82828
Closes #8037
2018-10-19 12:06:21 -07:00
Serapheim Dimitropoulos
ee900344f2 OpenZFS 9690 - metaslab of vdev with no space maps was flushed during removal
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9690
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4e75ba6826
Closes #8039
2018-10-19 12:05:03 -07:00
Matthew Ahrens
d637db98e1 OpenZFS 9681 - ztest failure in spa_history_log_internal due to spa_rename()
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9681
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6aee0ad7
Closes #8041
2018-10-19 12:02:28 -07:00
Tom Caputi
80a91e7469 Defer new resilvers until the current one ends
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.

This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee 
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7732
2018-10-18 21:06:18 -07:00
Matthew Ahrens
49394a7708 Linux does not HAVE_SMB_SHARE
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8032
2018-10-17 10:31:38 -07:00
Matthew Ahrens
5fbf85c4e2 Linux does not HAVE_DNLC
Since Linux does not have the Directory Name Lookup Cache, we don't need
the code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8031
2018-10-17 10:30:08 -07:00
Paul Dagnelie
d52d80b700 Add types to featureflags in zfs
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.

This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.

This functionality will be used by the Redacted Send/Receive feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7981
2018-10-16 11:15:04 -07:00
ilbsmart
779a6c0bf6 deadlock between mm_sem and tx assign in zfs_write() and page fault
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes #7939
2018-10-16 11:11:24 -07:00
Matthew Ahrens
0aa5916a30
OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects (#7979)
OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects

We're leaking the dd_clones objects in dsl_dir_destroy_sync.  This bug
appears to have been around forever.  Thankfully the amount of space
typically involved is tiny.

In addition this adds a mechanism in ZDB to find objects in the MOS
which are leaked (not referenced anywhere).

Porting notes:
* Added dd_crypto_obj to ZDB MOS object leak tracking

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://illumos.org/issues/9847
Closes #7979
2018-10-12 11:28:26 -07:00
Brian Behlendorf
27f80e85c2 Improved error handling for extreme rewinds
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing.  While this is the most common cause, it's not the only
cause.  Attemping to access a damaged ZAP will result in other
errors.

The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import.  Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully.  Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7921
2018-10-12 11:24:04 -07:00
Brian Behlendorf
d6c745830f Revert "Allow ECKSUM in vdev_checkpoint_sm_object()"
This reverts commit e927fc8a52.

Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7921
2018-10-12 11:24:04 -07:00
Matt Ahrens
5d43cc9a59 OpenZFS 9689 - zfs range lock code should not be zpl-specific
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t.  However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.

We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.

Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
  This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
  check to prevent needless warnings when integrating changes
  which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
  needed to defer the cv_destroy() which was called immediately
  after cv_broadcast().  With d2733258 this should be safe but
  if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
  to this change in order to make it easier to apply OpenZFS 9689.
  - d88895a0, which removed the dummy znode from zvol_state
  - e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function.  Preserved the
  rangelock free list.  The cv_destroy() function will block waiting
  for all processes in cv_wait() to be scheduled and drop their
  reference.  This is done to ensure it's safe to free the condition
  variable.  However, blocking while holding the rl->rl_lock mutex
  can result in a deadlock on Linux.  A free list is introduced to
  defer the cv_destroy() and kmem_free() until after the mutex is
  released.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes #7980
2018-10-11 10:19:33 -07:00
Paul Dagnelie
0391690583 Refactor dmu_recv into its own file
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7982
2018-10-09 14:05:13 -07:00
Brian Behlendorf
5e8ff25644 Fix arc_release() refcount
Update arc_release to use arc_buf_size().  This hunk was accidentally
dropped when porting compressed send/recv, 2aa34383b.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8000
2018-10-09 10:10:26 -07:00
Brian Behlendorf
d7e4b30a67 Add zfs_refcount_transfer_ownership_many()
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred.  Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;

This issue may occur when using encryption with refcount debugging
enabled.  An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder.  When unsharing the buffer the
p_abd should be transferred.

This issue does not impact production builds because refcount holders
are not tracked.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7219
Closes #8000
2018-10-09 10:05:48 -07:00
Matthew Ahrens
4cbde2ecbf Create /proc/sys/kernel/spl/gitrev with git hash
The existing mechanisms for determining what code is running in the
kernel do not always correctly report the git hash.  The versions
reported there do not reflect changes made since `configure` was run
(i.e. incremental builds do not update the version) and they are
misleading if git tags are not set up properly.  This applies to
`modinfo zfs`, `dmesg`, and `/sys/module/zfs/version`.

There are complicated requirements on how the existing version is
generated.  Therefore we are leaving that alone, and adding a new
mechanism to record and retrieve the git hash:
`cat /proc/sys/kernel/spl/gitrev`

The gitrev is re-generated at compile time, when running `make`
(including for incremental builds).  The value is the output of `git
describe` (or "unknown" if not in a git repo or there are uncommitted
changes).

We're also removing /proc/sys/kernel/spl/version, which was never very
useful.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7931 
Closes #7965
2018-10-08 21:57:02 -07:00
Matthew Ahrens
dfbe267503 OpenZFS 9617 - too-frequent TXG sync causes excessive write inflation
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
  changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
2018-10-04 13:13:28 -07:00
Paul Dagnelie
95542372e6 Add new fnvlist_lookup_* functions
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7977
2018-10-03 15:30:55 -07:00
Brad Lewis
c955398b52 OpenZFS 9677 - panic from zio_write_gang_block()
Panic from zio_write_gang_block() when creating dump device
on fragmented rpool.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7341a7d
Closes #7975
2018-10-03 09:50:06 -07:00
Tom Caputi
52ce99dd61 Refcounted DSL Crypto Key Mappings
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.

In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7949
2018-10-03 09:47:11 -07:00
Jerry Jelinek
f65fbee1e7 OpenZFS 9700 - ZFS resilvered mirror does not balance reads
Authored by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/82f63c3c
Closes #7973
2018-10-02 16:18:24 -07:00
Tim Schumacher
424fd7c3e0 Prefix all refcount functions with zfs_
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.

To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7963
2018-10-01 10:42:05 -07:00
Brian Behlendorf
1258bd778e
Refine split block reconstruction
Due to a flaw in 4589f3ae the number of unique combinations
could be calculated incorrectly.  This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.

This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.

The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest.  It is implemented such it will never
make a known recoverable block unrecoverable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900 
Closes #7934
2018-10-01 10:36:34 -07:00
John Gallagher
d12614521a Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):

* We don't account for the fact that seq_file sometimes visits a node
  multiple times, which results in missing messages when read through
  procfs.
* We don't keep separate state for each reader of a file, so concurrent
  readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
  the list between read syscalls, so reading from these files in procfs
  can cause the system to crash.

This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
2018-09-26 11:08:12 -07:00
Tim Schumacher
c13060e478 Linux 4.19-rc3+ compat: Remove refcount_t compat
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.

Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.

Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7885 
Closes #7932
2018-09-26 10:29:26 -07:00
Brian Behlendorf
7a23c81342
Fix small sysfs leak
When zfs_kobj_init() is called with an attr_cnt of 0 only the
kobj->zko_default_attrs is allocated.  It subsequently won't
get freed in zfs_kobj_release since the free is wrapped in
a kobj->zko_attr_count != 0 conditional.

Split the block in zfs_kobj_release() to make sure the
kobj->zko_default_attrs are freed in this case.

Additionally, fix a minor spelling mistake and typo in
zfs_kobj_init() which could also cause a leak but in practice
is almost certain not to fail.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7957
2018-09-26 09:50:58 -07:00
Brian Behlendorf
e897a23eb1
Fix statfs(2) for 32-bit user space
When handling a 32-bit statfs() system call the returned fields,
although 64-bit in the kernel, must be limited to 32-bits or an
EOVERFLOW error will be returned.

This is less of an issue for block counts since the default
reported block size in 128KiB. But since it is possible to
set a smaller block size, these values will be scaled as
needed to fit in a 32-bit unsigned long.

Unlike most other filesystems the total possible file counts
are more likely to overflow because they are calculated based
on the available free space in the pool. In order to prevent
this the reported value must be capped at 2^32-1. This is
only for statfs(2) reporting, there are no changes to the
internal ZFS limits.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7927 
Closes #7122 
Closes #7937
2018-09-24 17:11:25 -07:00
LOLi
dda5500853 vdev_disk_error() prints ASCII SOH to debug log
Currently vdev_disk_error() prepends its messages sent to the internal
ZFS debug log with KERN_WARNING, which is currently defined as follows:

   #define KERN_SOH      "\001"
   #define KERN_WARNING  KERN_SOH "4"

Since "\001" (ASCII Start Of Header) is not printable this results in
weird characters displayed when inspecting the debug log. This commit
simply removes this superfluous prefix passed to zfs_dbgmsg().

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7936
2018-09-21 09:42:42 -07:00
LOLi
7522a26077 Add limits to spa_slop_shift tunable
This change adds limits to the possible spa_slop_shift values set via
the sysfs interface. Accepted values are from a minimum of 1 to a
maximum of 31 (inclusive): these limits are based on the following
values observed on a 128PB file-vdev test pool:

spa_slop_shift=1, spa_get_slop_space=63.5PiB
spa_slop_shift=2, spa_get_slop_space=31.8PiB
spa_slop_shift=3, spa_get_slop_space=15.9PiB
spa_slop_shift=4, spa_get_slop_space=7.9PiB
spa_slop_shift=5, spa_get_slop_space=4PiB
spa_slop_shift=6, spa_get_slop_space=2PiB
...
spa_slop_shift=25, spa_get_slop_space=4GiB
spa_slop_shift=26, spa_get_slop_space=2GiB
spa_slop_shift=27, spa_get_slop_space=1016MiB
spa_slop_shift=28, spa_get_slop_space=508MiB
spa_slop_shift=29, spa_get_slop_space=254MiB
spa_slop_shift=30, spa_get_slop_space=128MiB
spa_slop_shift=31, spa_get_slop_space=128MiB
spa_slop_shift=32, spa_get_slop_space=128MiB

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7876 
Closes #7900
2018-09-20 21:10:12 -07:00
Roman Strashkin
733b5722b4 zpool split can create a corrupted pool
Added vdev_resilver_needed() check to verify VDEVs are fully
synced, so that after split the new pool will not be corrupted.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes #7865
Closes #7881
2018-09-12 18:14:42 -07:00
Don Brady
73a5ec30bf Fix in-kernel sysfs entries
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module).  When not built as a module add
the sysfs entries under /sys/fs/zfs/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7868 
Closes #7872
2018-09-06 21:44:52 -07:00
Don Brady
e7b677aa5d Fix zfs_sysfs_live test failure
The ZTS zfs_sysfs_live test fails occasionally due to an uninitialized
string on an error path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7869
2018-09-06 17:36:00 -07:00
Don Brady
cc99f275a2 Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
2018-09-05 18:33:36 -07:00
Chris Siebenmann
cfa37548eb Correctly handle errors from kern_path
As a regular kernel function, kern_path() returns errors as negative
errnos, such as -ELOOP. zfsctl_snapdir_vget() must convert these into
the positive errnos used throughout the ZFS code when it returns them
to other ZFS functions so that the ZFS code properly sees them as
errors.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Siebenmann <cks.git01@cs.toronto.edu>
Closes #7764
Closes #7864
2018-09-04 22:26:56 -07:00
Rich Ercolani
0405eeea6a Added recalculation of ARC stats mid-eviction
Re-adds a recalculation step for the ARC stats after the MRU
eviction so that we don't pathologically attempt to evict the MFU.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Mark Johnston <markj@freebsd.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #7855
2018-09-04 22:15:14 -07:00
Brian Behlendorf
27ca030fa6 Revert "Update zfs_admin_snapshot default value (disabled)"
This reverts commit a6214a0ae9.
Disabling zfs_admin_snapshot by default results in multiple ZTS
tests failing which depend on this functionality.  Revert this
change until the relevant test cases can be updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7838
2018-09-04 22:00:21 -07:00
George Melikov
a6214a0ae9 Update zfs_admin_snapshot default value (disabled)
It's disabled by default, update code to reflect
the documentation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7835 
Closes #7838
2018-09-04 17:21:24 -07:00
mav
c197a77c3c OpenZFS 9751 - Allocation throttling misplacing ditto blocks
Relax allocation throttling for ditto blocks.  Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment.  Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.

Sponsored by:	iXsystems, Inc.

Authored by: mav <mav@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9751
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/8253837ac3
Closes #7857
2018-09-02 12:22:45 -07:00
mav
e38afd34c3 OpenZFS 9738 - Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.

Authored by: mav <mav@FreeBSD.org>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9738
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/63e7138
Closes #7858
2018-09-02 12:21:54 -07:00
Don Brady
b83a0e2dc1 Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).

Programmatically document, via a zfs_ioc_key_t, the valid arguments 
for the ioc commands that rely on nvpair input arguments (i.e. non 
legacy commands from libzfs_core). Automatically verify the expected 
pairs before dispatching a command.

This initial phase focuses on the non-legacy ioctls. A follow-on 
change can address the legacy ioctl input from the zfs_cmd_t.

The zfs_ioc_key_t for zfs_keys_channel_program looks like:

static const zfs_ioc_key_t zfs_keys_channel_program[] = {
       {"program",     DATA_TYPE_STRING,               0},
       {"arg",         DATA_TYPE_UNKNOWN,              0},
       {"sync",        DATA_TYPE_BOOLEAN_VALUE,        ZK_OPTIONAL},
       {"instrlimit",  DATA_TYPE_UINT64,               ZK_OPTIONAL},
       {"memlimit",    DATA_TYPE_UINT64,               ZK_OPTIONAL},
};

Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE, 
EBADF, and E2BIG).

ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7780
2018-09-02 12:14:01 -07:00
Don Brady
e8bcb693d6 Add zfs module feature and property info to sysfs
This extends our sysfs '/sys/module/zfs' entry to include feature 
and property attributes. The primary consumer of this information 
is user processes, like the zfs CLI, that need to know what the 
current loaded ZFS module supports. The libzfs binary will consult 
this information when instantiating the zfs and zpool property 
tables and the pool features table.

This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
  features.runtime
  features.pool
  properties.dataset
  properties.pool

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7706
2018-09-02 12:09:53 -07:00
Brian Behlendorf
e927fc8a52
Allow ECKSUM in vdev_checkpoint_sm_object()
The checkpoint space map object may not be accessible from the
vdev's ZAP when it has been damaged.  This may be the case when
performing an extreme rewind when importing the pool.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7853
2018-08-31 14:20:34 -07:00
Matthew Ahrens
adb726eb0e clean up __dbuf_hold_impl
We can simplify the dbuf_hold code by allocating dbuf_hold_arg_t's on
demand, rather than allocating a big array of them up front.  While this
can occasionally increase the number of allocations, typically only one
allocation is needed since the indirect block is already cached.

The performance test suite gets the same results with this change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7841
2018-08-31 10:16:54 -07:00
Tom Caputi
c3bd3fb4ac OpenZFS 9403 - assertion failed in arc_buf_destroy()
Assertion failed in arc_buf_destroy() when concurrently reading
block with checksum error.

Porting notes:
* The ability to zinject decompression errors has been added, but
  this only works at the zio_decompress() level, where we have all
  of the info we need to match against the user's zinject options.
* The decompress_fault test has been added to test the new zinject
  functionality
* We attempted to set zio_decompress_fail_fraction to (1 << 18) in
  ztest for further test coverage. Although this did uncover a few
  low priority issues, this unfortuantely also causes ztest to
  ASSERT in many locations where the code is working correctly since
  it is designed to fail on IO errors. Developers can manually set
  this variable with the '-o' option to find and debug issues.

Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Tom Caputi <tcaputi@datto.com>

OpenZFS-issue: https://illumos.org/issues/9403
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9
Closes #7822
2018-08-29 11:33:33 -07:00
Tom Caputi
47ab01a18f Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-27 10:16:28 -07:00
Tom Caputi
8c4fb36a24 Small rework of txg_list code
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7795
2018-08-27 10:16:01 -07:00
Brian Behlendorf
a584ef2605
Direct IO support
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224 
Closes #7823
2018-08-27 10:04:21 -07:00
Joao Carlos Mendes Luis
5d6ad2442b Fedora 28: Fix misc bounds check compiler warnings
Fix a bunch of truncation compiler warnings that show up
on Fedora 28 (GCC 8.0.1).

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7368 
Closes #7826 
Closes #7830
2018-08-26 12:55:44 -07:00
LOLi
c434d8806c Stack overflow when destroying deeply nested clones
Destroy operations on deeply nested chains of clones can overflow
the stack:

        Depth    Size   Location    (221 entries)
        -----    ----   --------
  0)    15664      48   mutex_lock+0x5/0x30
  1)    15616       8   mutex_lock+0x5/0x30
...
 26)    13576      72   dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs]
 27)    13504      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
 28)    13432      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
...
185)     2128      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
186)     2056      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
187)     1984      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
188)     1912     136   dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs]
189)     1776      16   dsl_destroy_snapshot_check+0x0/0x90 [zfs]
...
218)      304     128   kthread+0xdf/0x100
219)      176      48   ret_from_fork+0x22/0x40
220)      128     128   kthread+0x0/0x100

Fix this issue by converting dsl_dataset_remove_clones_key() from
recursive to iterative.

Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7279 
Closes #7810
2018-08-22 11:03:31 -07:00
Tim Chase
2711b1d05f s/VERIFY/VERIFY3S in vdev_checkpoint_sm_object
Using VERIFY3S allows to view the unexpected error value in the system
log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #7809 
Closes #7818
2018-08-21 16:08:14 -07:00
Tom Caputi
149ce888bb Fix issues with raw receive_write_byref()
This patch fixes 2 issues with raw, deduplicated send streams. The
first is that datasets who had been completely received earlier in
the stream were not still marked as raw receives. This caused
problems when newly received datasets attempted to fetch raw data
from these datasets without this flag set.

The second problem was that the arc freeze checksum code was not
consistent about which locks needed to be held while performing
its asserts. The proper locking needed to run these asserts is
actually fairly nuanced, since the asserts touch the linked list
of buffers (requiring the header lock), the arc_state (requiring
the b_evict_lock), and the b_freeze_cksum (requiring the
b_freeze_lock). This seems like a large performance sacrifice and
a lot of unneeded complexity to verify that this relatively small
debug feature is working as intended, so this patch simply removes
these asserts instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7701
2018-08-20 11:03:56 -07:00
Serapheim Dimitropoulos
a448a2557e Introduce read/write kstats per dataset
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
2018-08-20 09:52:37 -07:00
Brian Behlendorf
4338c5c06f
Fix traverse_impl() kmem leak
The error path must free the memory allocated by this function or
it will be leaked.  In practice, this would leak only a few bytes
of memory under rare circumstances and thus is unlikely to have
caused any real problems.  This issue was caught by the kmemleak.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7791
2018-08-15 09:53:44 -07:00
Tom Caputi
1fff937a4c Check encrypted dataset + embedded recv earlier
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:49:19 -07:00
Tom Caputi
d9c460a0b6 Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:

* Receiving an unencrypted stream as encrypted without
  needing to create a "dummy" encrypted parent so that
  encryption can be inheritted.

* Allowing users to change their keylocation on receive,
  so long as the receiving dataset is an encryption root.

* Allowing users to explicitly exclude or override the
  encryption property from an unencrypted properties stream,
  allowing it to be received as encrypted.

* Receiving a recursive heirarchy of unencrypted datasets,
  encrypting the top-level one and forcing all children to
  inherit the encryption.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:48:49 -07:00
Tomohiro Kusumi
fe8a7982ca Fix comment on calculating blkid
Fix comment on calculating blkid at level n within dnode's blkptrs.
"(2^(level*(indblkshift - SPA_BLKPTRSHIFT)" is part of divisor
in this division.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7768
2018-08-13 13:33:47 -07:00
LOLi
c8c308362c Allow inherited properties in zfs_check_settable()
This change modifies how 'checksum' and 'dedup' properties are verified
in zfs_check_settable() handling the case where they are explicitly
inherited in the dataset hierarchy when receiving a recursive send
stream.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7755 
Closes #7576 
Closes #7757
2018-08-03 14:56:25 -07:00
Don Brady
fc1ecd16d7 zfs_ioc_unload_key can drop extra spa ref
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7759
2018-08-03 14:50:51 -07:00
Matthew Ahrens
62840030a7 Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching.  We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
2018-08-02 15:51:45 -07:00
John Gallagher
499b5497cb Add missing checks to zpl_xattr_* functions
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #5866 
Closes #7761
2018-08-02 14:03:56 -07:00
Nathan Lewis
010d12474c Add support for selecting encryption backend
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
  that control the crypto implementation.  At the moment there is a
  choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
  15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
  with as key schedules generated with various implementations
  are not necessarily interchangable.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes #7102 
Closes #7103
2018-08-02 11:59:24 -07:00
George Wilson
3d503a76e8 Fix OpenZFS 9337 mismerge
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.

This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #7758
2018-08-02 10:21:48 -07:00
Rohan Puri
fd7265c646 Fix deadlock between zfs umount & snapentry_expire
zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a
writer and calls zfsctl_snapshot_unmount_cancel(), which waits
for snapentry_expire() if present (when snap is automounted).
This snapentry_expire() itself then waits for zfs_snapshot_lock
as a reader, resulting in a deadlock.

The fix is to only hold the zfs_snapshot_lock over the tree
lookup and removal.  After a successful lookup the lock can
be dropped and zfs_snapentry_t will remain valid until the
reference taken by the lookup is released.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
Closes #7751
Closes #7752
2018-08-01 15:00:02 -07:00
Paul Dagnelie
492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
Don Brady
dae3e9ea21 OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
  zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
  were intentially not added as tunable module parameters.

OpenZFS-issue: https://illumos.org/issues/9465
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef
Closes #7749
2018-07-30 11:30:41 -07:00
Serapheim Dimitropoulos
6b64382b17 OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations
= Motivation

While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.

An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists.  Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg

Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.

= What this patch does

In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).

= Performance Analysis

To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.

== Results (before patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        5.3
user        0.4
sys         2.3

[Trial 5]
real        8.2
user        0.4
sys         2.4

[Trial 6]
real        6.0
user        0.5
sys         2.3
```

== Results (after patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        4.9
user        0.0
sys         0.9

[Trial 5]
real        3.8
user        0.0
sys         0.9

[Trial 6]
real        3.6
user        0.0
sys         0.9
```

== Analysis

The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.

From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts.  Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().

From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).

Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
  zpool_do_events_nvprint().

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1
Closes #7748
2018-07-30 11:30:03 -07:00
Matthew Ahrens
1897bc0d48 OpenZFS 9439 - ZFS double-free due to failure to dirty indirect block
Follow up commit for OpenZFS 9438.  See the OpenZFS-issue link below
for a complete analysis.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9439
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/779220d
External-issue: DLPX-46861
Closes #7746
2018-07-30 09:28:09 -07:00
Paul Dagnelie
21d48b5eac OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth times
As reported by https://github.com/zfsonlinux/zfs/issues/4996, there is
yet another hole birth issue. In this one, if a block is entirely holes,
but the birth times are not all the same, we lose that information by
creating one hole with the current txg as its birth time.

The ZoL PR's fix approach is incorrect. Ultimately, the problem here is
that when you truncate and write a file in the same transaction group,
the dbuf for the indirect block will be zeroed out to deal with the
truncation, and then written for the write. During this process, we will
lose hole birth time information for any holes in the range. In the case
where a dnode is being freed, we need to determine whether the block
should be converted to a higher-level hole in the zio pipeline, and if
so do it when the dnode is being synced out.

Porting Notes:
* The DMU_OBJECT_END change in zfs_znode.c was already applied.
* Added test cases from #5675 provided by @rincebrain for hole_birth
  issues.  These test cases should be pushed upstream to OpenZFS.
* Updated mk_files which is used by several rsend tests so the
  files created are a little more interesting and may contain holes.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9438
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/738e2a3c
External-issue: DLPX-46861
Closes #7746
2018-07-30 09:27:49 -07:00
Brian Behlendorf
11d0525cbb
Add rwsem_tryupgrade for 4.9.20-rt16 kernel
The RT rwsem implementation was changed to allow multiple readers
as of the 4.9.20-rt16 patch set.  This results in a build failure
because the existing implementation was forced to directly access
the rwsem structure which has changed.

While this could be accommodated by adding additional compatibility
code.  This patch resolves the build issue by simply assuming the
rwsem can never be upgraded.  This functionality is a performance
optimization and all callers must already handle this case.

Converting the last remaining use of __SPIN_LOCK_UNLOCKED to
spin_lock_init() was additionally required to get a clean build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7589
2018-07-30 09:22:30 -07:00
Toomas Soome
5fadb7fb0c OpenZFS 8906 - uts: illumos rootfs should support salted cksum
Porting notes:
* As of grub-2.02 these checksums are not supported.  However, as
  pointed out in #6501 there are alternatives such as EFISTUB which
  work and have no such restriction.  A warning was added to the
  checksum property section of the zfs.8 man page.

Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #6501
Closes #7714
2018-07-27 08:35:28 -07:00
Matthew Ahrens
3a549dc7a1 OpenZFS 9442 - decrease indirect block size of spacemaps
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
  and dmu_object_alloc_dnsize() functions to use a common shared
  dmu_object_alloc_impl() function.

OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b
Closes #7712
2018-07-25 14:11:35 -07:00
Feng Sun
750e1f88d3 Introduce kstat dmu_tx_dirty_frees_delay
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #7718
2018-07-25 09:52:27 -07:00
Matthew Ahrens
802b1a7b3b OpenZFS 9338 - moved dnode has incorrect dn_next_type
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

While investigating a different problem, I noticed that moved dnodes
(those processed by dnode_move_impl() via kmem_move()) have an incorrect
dn_next_type. This could cause the on-disk dn_type to be changed to an
invalid value. The fix to copy the dn_next_type in dnode_move_impl().

Porting notes:
* For the moment this potential issue cannot occur on Linux since
  the SPL does not provide the kmem_move() functionality.

OpenZFS-issue: https://illumos.org/issues/9338
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0717e6f13
Closes #7715
2018-07-24 17:10:42 -07:00
Tom Caputi
b7ddeaef3d Refactor arc_hdr_realloc_crypt()
The arc_hdr_realloc_crypt() function is responsible for converting
a "full" arc header to an extended "crypt" header and visa versa.
This code was originally written with a bcopy() so that any new
members added to arc headers would automatically be included
without requiring a code change. However, in practice this (along
with small differences in kmem_cache implementations between
various platforms) has caused a number of hard-to-find problems in
ports to other operating systems. This patch solves this problem
by making all member copies explicit and adding ASSERTs for fields
that cannot be set during the transfer. It also manually resets the
old header after the reallocation is finished so it can be properly
reallocated and reused.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7711
2018-07-24 12:20:04 -07:00
Steven Noonan
863522b1f9 dsl_scan_scrub_cb: don't double-account non-embedded blocks
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.

The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.

This was apparently a regression introduced in commit 00c405b4b5,
which was an incorrect port of this OpenZFS commit:

    https://github.com/openzfs/openzfs/commit/d8a447a7

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720 
Closes #7738
2018-07-24 09:33:56 -07:00
Brian Behlendorf
d441e85dd7
Add support for autoexpand property
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report the maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #120
Closes #2437
Closes #5771
Closes #7366
Closes #7582
Closes #7629
2018-07-23 15:40:15 -07:00
Matthew Ahrens
2e5dc449c1 OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
2018-07-12 10:49:27 -07:00
Don Brady
e4e94ca315 OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1
Closes #7700
2018-07-11 15:55:48 -07:00
Andriy Gapon
e902ddb0f8 OpenZFS 9479 - fix wrong format specifier for vdev_id
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9479
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/20aa447c
Closes #7699
2018-07-11 15:53:02 -07:00
Brian Behlendorf
ac09630d8b
Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock.  This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7598 
Closes #7659 
Closes #7691 
Closes #7693
2018-07-11 15:49:10 -07:00
Brian Behlendorf
33a19e0fd9
Fix kernel unaligned access on sparc64
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time.  Otherwise
fallback to using the sa_copy_data() function.

The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space.  In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7642 
Closes #7684
2018-07-11 13:10:40 -07:00
Matthew Ahrens
2dca37d8dc OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')"
Ztest failed with the following crash.

    ::status

    debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter
    file: /usr/bin/amd64/ztest
    initial argv: /usr/bin/amd64/ztest
    threading model: raw lwps
    status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1
    panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid
    value type 'function' for key 'error')

    ::stack

    libc.so.1`_lwp_kill+0xa()
    libc.so.1`_assfail+0x182(fffffd7fffdfe8d0, 0, 0)
    libc.so.1`assfail+0x19(fffffd7fffdfe8d0, 0, 0)
    libzpool.so.1`vpanic+0x3d(fffffd7ffaa58c20, fffffd7fffdfeb00)
    0xfffffd7ffaa28146()
    0xfffffd7ffaa0a109()
    libzpool.so.1`luaD_throw+0x86(3011a48, 2)
    0xfffffd7ffa9350d3()
    0xfffffd7ffa93e3f1()
    libzpool.so.1`zcp_lua_to_nvlist+0x33(3011a48, 1, 2686470, fffffd7ffaa2e2c3)
    libzpool.so.1`zcp_convert_return_values+0xa4(3011a48, 2686470, fffffd7ffaa2e2c3, fffffd7fffdfedd0)
    libzpool.so.1`zcp_pool_error+0x59(fffffd7fffdfedd0, 1e0f450)
    libzpool.so.1`zcp_eval+0x6f8(1e0f450, fffffd7ffaa483f8, 1, 0, 6400000, 1d33b30)
    libzpool.so.1`dsl_destroy_snapshots_nvl+0x12c(2786b60, 0, 484750)
    libzpool.so.1`dsl_destroy_snapshot+0x4f(fffffd7fffdfef70, 0)
    ztest_dsl_dataset_cleanup+0xea(fffffd7fffdff4c0, 1)
    ztest_dataset_destroy+0x53(1)
    ztest_run+0x59f(fffffd7fff0e0498)
    main+0x7ff(1, fffffd7fffdffa88)
    _start+0x6c()

The problem is that zcp_convert_return_values() assumes that there's
exactly one value on the stack, but that isn't always true. It ends up
putting the wrong thing on the stack which is then consumed by
zcp_convert_return values, which either adds the wrong message to the
nvlist, or blows up.

The fix is to make sure that callers of zcp_convert_return_values()
clear the stack before pushing their error message, and
zcp_convert_return_values() should VERIFY that the stack is the expected
size.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9424
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb7e57429
Closes #7696
2018-07-10 21:29:23 -07:00
Matthew Ahrens
00c405b4b5 OpenZFS 9454 - ::zfs_blkstats should count embedded blocks
When we do a scrub or resilver, ZFS counts the different types of blocks,
which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to
count embedded blocks.

Porting notes:
* Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional
  as part of the sequential resilver functionality.  Since phys_birth
  would be zero that case should never happen as described above.  This
  is confirmed by the code coverage analysis.  Remove the conditional
  to realign that aspect of this function with OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9454
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d8a447a7
Closes #7697
2018-07-10 10:41:38 -07:00
Prakash Surya
ab11916583 OpenZFS 9456 - ztest failure in zil_commit_waiter_timeout
Problem
=======

Illumos bug 8373 was integrated, which now presents a code path where
"dmu_tx_assign" can fail.  When "dmu_tx_assign" fails, it will not issue
the lwb that was passed in to "zil_lwb_write_issue".  As a result, when
"zil_lwb_write_issue" returns, the lwb will still be in the "opened"
state, just as it was when "zil_lwb_write_issue" was originally called.

Solution
========

As a result of this new call path, the failed assertion needs to be
modified to be aware of this new possibility. Thus, we can only assert
that the lwb is no longer in the "opened" state if the returned lwb is
non-null, since we cannot differentiate between the case of
"dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to
"zil_lwb_write_issue".

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9456
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8b09f4e
Closes #7695
2018-07-10 10:25:14 -07:00
Serapheim Dimitropoulos
a7ed98d8b5 OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.

The problem can be seen by attempting to create a dataset using the -p
option with many levels:

    panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff

    fffffffffbc3aa60 unix:die+100 ()
    fffffffffbc3ab70 unix:trap+157d ()
    ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
    ffffff00083d7050 zfs:dbuf_rele+2e ()
    ...
    ffffff00083d7080 zfs:dsl_dir_close+32 ()
    ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
    ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
    ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
    ffffff00083d7130 zfs:dbuf_rele+2e ()
    ... The block above repeats once per directory in the ...
    ... create -p command, working towards the root ...
    ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
    ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
    ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
    ffffff00083dbaa0 zfs:findfunc+23 ()
    ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
    ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
    ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
    ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
    ffffff00083dbd00 genunix:cdev_ioctl+45 ()
    ffffff00083dbd40 specfs:spec_ioctl+5a ()
    ffffff00083dbdc0 genunix:fop_ioctl+7b ()
    ffffff00083dbec0 genunix:ioctl+18e ()
    ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()

Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K.  Enough time has passed
  that 16K can be reasonably assumed to be the default value.  It
  was increased in the 3.15 kernel released in June of 2014.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>

OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
2018-07-09 13:02:50 -07:00
Serapheim Dimitropoulos
4d044c4c1d OpenZFS 9238 - ZFS Spacemap Encoding V2
Motivation
==========

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
    This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
    device removal, zpool checkpoint) we've started imposing limits in the
    vdevs that can be used with them based on the maximum addressable offset
    (currently 64PB for a top-level vdev).

New encoding
============

The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.

Other references:
=================

The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
2018-07-05 12:02:34 -07:00
Tom Caputi
370bbf66ae Fix coverity defects: CID 176037
CID 176037: Uninitialized scalar variable

This patch fixes an uninitialized variable defect caught by
coverity and introduced in 69830602

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7667
2018-07-02 13:37:48 -07:00
Tom Caputi
da2feb42fb Fix 'zfs recv' of non large_dnode send streams
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7617 
Closes #7662
2018-06-28 14:55:11 -07:00
Chunwei Chen
edf60b8645 Enforce PROP_ONETIME on zpool properties
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #7661
2018-06-28 14:49:17 -07:00
Tom Caputi
69830602de Raw receive fix and encrypted objset security fix
This patch fixes two problems with the encryption code. First, the
current code does not correctly prohibit the DMU from updating
dn_maxblkid during object truncation within a raw receive. This
usually only causes issues when the truncating DRR_FREE record is
aggregated with DRR_FREE records later in the receive, so it is
relatively hard to hit.

Second, this patch fixes a security issue where reading blocks
within an encrypted object did not guarantee that the dnode block
itself had ever been verified against its MAC. Usually the
verification happened anyway when the bonus buffer was read, but
some use cases (notably zvols) might never perform the check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7632
2018-06-28 09:20:34 -07:00
Serapheim Dimitropoulos
d2734cce68 OpenZFS 9166 - zfs storage pool checkpoint
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
2018-06-26 10:07:42 -07:00
Serapheim Dimitropoulos
7637ef8d23 OpenZFS 9591 - ms_shift can be incorrectly changed
ms_shift can be incorrectly changed changed in MOS config for
indirect vdevs that have been historically expanded

According to spa_config_update() we expect new vdevs to have
vdev_ms_array equal to 0 and then we go ahead and set their metaslab
size. The problem is that indirect vdevs also have vdev_ms_array == 0
because their metaslabs are destroyed once their removal is done.

As a result, if a vdev was expanded and then removed may have its
ms_shift changed if another vdev was added after its removal.
Fortunately this behavior does not cause any type of crash or bad
behavior in the kernel but it can confuse zdb and anyone doing any kind
of analysis of the history of the pools.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651
OpenZFS-issue: https://illumos.org/issues/9591a
External-issue: DLPX-58879
Closes #7644
2018-06-21 09:35:26 -07:00
Matthew Ahrens
af43029484 Remove suffix from zio taskq names
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace.  This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.

We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.

Note that we already support multiple taskq's with the same name.  This
happens when there are multiple pools.  In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7646
2018-06-20 14:07:50 -07:00
Brian Behlendorf
1c38ac61e1
Linux 4.14 compat: blk_queue_stackable()
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212.  This change
resulted in the default elevator being used which can negatively
impact performance.

Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator.  Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().

Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269.  Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7645
2018-06-19 21:52:45 -07:00
Brian Behlendorf
6413c95fbd
Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe.  As part of this change the current_time() function was
also updated to return the timespec64 type.

Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode.  It should
be used when working with inode timestamps to ensure matching types.

The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t.  Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.

Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other.  They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
2018-06-19 21:51:18 -07:00
Tom Caputi
cd32e5db8b Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
2018-06-18 14:10:54 -07:00
John Gallagher
917f475fba Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618
2018-06-15 15:10:42 -07:00
Brian Behlendorf
7b98f0d91f
Linux compat 4.18: check_disk_size_change()
Added support for the bops->check_events() interface which was
added in the 2.6.38 kernel to replace bops->media_changed().
Fully implementing this functionality allows the volume resize
code to rely on revalidate_disk(), which is the preferred
mechanism, and removes the need to use check_disk_size_change().

In order for bops->check_events() to lookup the zvol_state_t
stored in the disk->private_data the zvol_state_lock needs to
be held.  Since the check events interface may poll the mutex
has been converted to a rwlock for better concurrently.  The
rwlock need only be taken as a writer in the zvol_free() path
when disk->private_data is set to NULL.

The configure checks for the block_device_operations structure
were consolidated in a single kernel-block-device-operations.m4
file.

The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks
and assoicated dead code was removed.  This interface was added
to the 2.6.28 kernel which predates the oldest supported 2.6.32
kernel and will therefore always be available.

Updated maximum Linux version in META file.  The 4.17 kernel
was released on 2018-06-03 and ZoL is compatible with the
finalized kernel.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7611
2018-06-15 15:05:21 -07:00
Matthew Ahrens
1fac63e56f OpenZFS 9577 - remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
   TSD values (by setting to NULL vs non-NULL) is expensive, and
   we do it very often.
3. According to Nexenta, the current semantics can cause a
   deadlock when concurrently calling dmu_objset_evict_dbufs()
   (which is rare today, but they are working on a "parallel
   unmount" change that triggers this more easily):

Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes #7602
2018-06-13 11:05:06 -07:00
Paul Zuchowski
2ffd89fcb9 Wrong error message when removing log device
In the case where the pool is loaded without the crypto
keys necessary to playback the intent log, and log device
removal is attempted, a generic busy message is received.
Change the message to inform the user that the datasets
must be mounted.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7518
2018-06-07 18:07:29 -07:00
Brian Behlendorf
174bcd581d
Fix preemptible warning in aggsum_add()
In the new aggsum counters the CPU_SEQID macro should be surrounded by
kpreempt_disable)() and kpreempt_enable() calls to prevent a Linux
kernel BUG warning.  The addsum_add() function use the cpuid to
minimize lock contention when selecting a bucket, after selection
the bucket is protected by a mutex and it is safe to reschedule the
process to a different processor at any time.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7609
Closes #7610
2018-06-07 15:55:11 -07:00
Nathaniel Clark
fba33c3819 Don't panic on bad SA_MAGIC in sa_build_index
If sa_build_index() encounters a corrupt buffer, don't panic.
Add info to zfs ring buffer and return EIO.  This allows for a cleaner
error recovery path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Issue #6500 
Closes #7487
2018-06-07 09:51:56 -07:00
Tom Caputi
b405837a6c Update the correct abd in l2arc_read_done()
This patch fixes an issue where l2arc_read_done() would always
write data to b_pabd, even if raw encrypted data was requested.
This only occured in cases where the L2ARC device had a different
ashift than the main pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7586 
Closes #7593
2018-06-06 10:17:50 -07:00
Tom Caputi
e7504d7a18 Raw receive functions must not decrypt data
This patch fixes a small bug found where receive_spill() sometimes
attempted to decrypt spill blocks when doing a raw receive. In
addition, this patch fixes another small issue in arc_buf_fill()'s
error handling where a decryption failure (which could be caused by
the first bug) would attempt to set the arc header's IO_ERROR flag
without holding the header's lock.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7564 
Closes #7584 
Closes #7592
2018-06-06 10:16:41 -07:00
Paul Dagnelie
37fb3e4318 OpenZFS 8484 - Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Paul Dagnelie <pcd@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7
Issue #3752
Closes #7462
2018-06-06 09:35:59 -07:00
Tony Hutter
f0ed6c7448 Add pool state /proc entry, "SUSPENDED" pools
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7331 
Closes #7563
2018-06-06 09:33:54 -07:00
Serapheim Dimitropoulos
e48afbc4eb OpenZFS 9464 - txg_kick() fails to see that we are quiescing
txg_kick() fails to see that we are quiescing, forcing transactions to
their next stages without leaving them accumulate changes

Creating a fragmented pool in a DCenter VM and continuously writing to it with
multiple instances of randwritecomp, we get the following output from txg.d:

    0ms   311MB in  4114ms (95% p1)  75MB/s  544MB (76%)  336us   153ms     0ms
    0ms     8MB in    51ms ( 0% p1) 163MB/s  474MB (66%)  129us    34ms     0ms
    0ms   366MB in  4454ms (93% p1)  82MB/s  572MB (79%)  498us    20ms     0ms
    0ms   406MB in  5212ms (95% p1)  77MB/s  591MB (82%)  661us    37ms     0ms
    0ms   340MB in  5110ms (94% p1)  66MB/s  622MB (86%) 1048us    41ms     1ms
    0ms     3MB in    61ms ( 0% p1)  51MB/s  419MB (58%)   33us     0ms     0ms
    0ms   361MB in  3555ms (88% p1) 101MB/s  542MB (75%)  335us    40ms     0ms
    0ms   356MB in  4592ms (92% p1)  77MB/s  561MB (78%)  430us    89ms     1ms
    0ms    11MB in   129ms (13% p1)  90MB/s  507MB (70%)  222us    15ms     0ms
    0ms   281MB in  2520ms (89% p1) 111MB/s  542MB (75%)  334us    42ms     0ms
    0ms   383MB in  3666ms (91% p1) 104MB/s  557MB (77%)  411us   133ms     0ms
    0ms   404MB in  5757ms (94% p1)  70MB/s  635MB (88%) 1274us   123ms     2ms
    4ms   367MB in  4172ms (89% p1)  88MB/s  556MB (77%)  401us    51ms     0ms
    0ms    42MB in   470ms (44% p1)  90MB/s  557MB (77%)  412us    43ms     0ms
    0ms   261MB in  2273ms (88% p1) 114MB/s  556MB (77%)  407us    27ms     0ms
    0ms   394MB in  3646ms (85% p1) 108MB/s  552MB (77%)  393us   304ms     0ms
    0ms   275MB in  2416ms (89% p1) 113MB/s  510MB (71%)  200us    53ms     0ms
    0ms     9MB in    53ms ( 0% p1) 169MB/s  483MB (67%)  140us   100ms     1ms

The TXGs that are getting synced and don't have lots of changes are pushed by
txg_kick() which basically forces the current open txg to get to the quiesced
state:

        if (tx->tx_syncing_txg == 0 &&
        tx->tx_quiesce_txg_waiting <= tx->tx_open_txg &&
        tx->tx_sync_txg_waiting <= tx->tx_synced_txg &&
        tx->tx_quiesced_txg <= tx->tx_synced_txg) {
        tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1;
        cv_broadcast(&tx->tx_quiesce_more_cv);
    }

The problem is that the above code doesn't check if we are currently quiescing
anything (only if a quiesce or a sync has been requested, ..etc) so the
following scenario can happen:

1] We have an open txg A that had enough dirty data (more than
   zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened
   a new txg B. No txg is currently being synced.
2] Immediately after the opening of B, txg_kick() was run by some other write
   (and because of A's dirty data) and saw that we are not currently syncing
   any txg and no one has requested quiescing so it requests one by bumping
   tx_quiesce_txg_waiting and broadcasts the quiesce thread.
3] The quiesce thread just passed txg A to be synced and sees that a quiescing
   request has been sent to it so it immediately grabs B without letting it
   gather enough data, putting it in a quiesced state and opening a new txg C.

In this scenario txg B, is an example of how the entries of interest show up in
the txg.d output.

Ideally we would like txg_kick() to get triggered only when we are sure that
we are not syncing AND not quiescing any txg. This way we can kick an open TXG
to the quiescing state when we are sure that there is nothing going on and we
would benefit from the different states running concurrently.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9464
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635b
Closes #7587
2018-06-04 14:56:06 -07:00
Pavel Zakharov
8a393be353 OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_t
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44
Closes #7532
2018-06-04 14:54:20 -07:00
Matthew Ahrens
1a5b96b8ee OpenZFS 9329 - panic in zap_leaf_lookup() due to concurrent zapification
For the null pointer issue shown below, the solution is to initialize the
contents of the object before changing its type, so that concurrent accessors
will see it as non-zapified until it is ready for access via the ZAP.

    BAD TRAP: type=e (#pf Page fault) rp=ffffff00ff520440 addr=20 occurred
    in module "zfs" due to a NULL pointer dereference

    ffffff00ff520320 unix:die+df ()
    ffffff00ff520430 unix:trap+dc0 ()
    ffffff00ff520440 unix:cmntrap+e6 ()
    ffffff00ff520590 zfs:zap_leaf_lookup+46 ()
    ffffff00ff520640 zfs:fzap_lookup+a9 ()
    ffffff00ff5206e0 zfs:zap_lookup_norm+111 ()
    ffffff00ff520730 zfs:zap_contains+42 ()
    ffffff00ff520760 zfs:dsl_dataset_has_resume_receive_state+47 ()
    ffffff00ff520900 zfs:get_receive_resume_stats+3e ()
    ffffff00ff520a90 zfs:dsl_dataset_stats+262 ()
    ffffff00ff520ac0 zfs:dmu_objset_stats+2b ()
    ffffff00ff520b10 zfs:zfs_ioc_objset_stats_impl+64 ()
    ffffff00ff520b60 zfs:zfs_ioc_objset_stats+33 ()
    ffffff00ff520bd0 zfs:zfs_ioc_dataset_list_next+140 ()
    ffffff00ff520c80 zfs:zfsdev_ioctl+4d7 ()
    ffffff00ff520cc0 genunix:cdev_ioctl+39 ()
    ffffff00ff520d10 specfs:spec_ioctl+60 ()
    ffffff00ff520da0 genunix:fop_ioctl+55 ()
    ffffff00ff520ec0 genunix:ioctl+9b ()
    ffffff00ff520f10 unix:brand_sys_sysenter+1c9 ()

Porting Notes:
* DMU_OT_BYTESWAP conditional in zap_lockdir_impl() kept.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9329
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e8e0f97
Closes #7578
2018-05-31 10:53:49 -07:00
Matthew Ahrens
d2a12f9e2a OpenZFS 9328 - zap code can take advantage of c99
The ZAP code was written before we allowed c99 in the Solaris kernel. We
should change it to take advantage of being able to declare variables where
they are first used. This reduces variable scope and means less scrolling
to find the type of variables.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9328
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/76ead05
Closes #7578
2018-05-31 10:53:11 -07:00
Sara Hartse
74d42600d8 zpool reopen should detect expanded devices
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.

Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes #7546 
Issue #7582
2018-05-31 10:36:37 -07:00
Tony Hutter
c26cf0966d Fix zio->io_priority failed (7 < 6) assert
This fixes an assert in vdev_queue_change_io_priority():

  VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
  PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7566 
Closes #7542
2018-05-29 18:13:48 -07:00
Brian Behlendorf
93ce2b4ca5 Update build system and packaging
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.

Build system and packaging:
  * Renamed SPL_* autoconf m4 macros to ZFS_*.
  * Removed redundant SPL_* autoconf m4 macros.
  * Updated the RPM spec files to remove SPL package dependency.
  * The zfs package obsoletes the spl package, and the zfs-kmod
    package obsoletes the spl-kmod package.
  * The zfs-kmod-devel* packages were updated to add compatibility
    symlinks under /usr/src/spl-x.y.z until all dependent packages
    can be updated.  They will be removed in a future release.
  * Updated copy-builtin script for in-kernel builds.
  * Updated DKMS package to include the spl.ko.
  * Updated stale AUTHORS file to include all contributors.
  * Updated stale COPYRIGHT and included the SPL as an exception.
  * Renamed README.markdown to README.md
  * Renamed OPENSOLARIS.LICENSE to LICENSE.
  * Renamed DISCLAIMER to NOTICE.

Required code changes:
  * Removed redundant HAVE_SPL macro.
  * Removed _BOOT from nvpairs since it doesn't apply for Linux.
  * Initial header cleanup (removal of empty headers, refactoring).
  * Remove SPL repository clone/build from zimport.sh.
  * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
    to build issues when forcing C99 compilation.
  * Replaced legacy ACCESS_ONCE with READ_ONCE.
  * Include needed headers for `current` and `EXPORT_SYMBOL`.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
2018-05-29 16:00:33 -07:00
Brian Behlendorf
1272941f49 Merge branch 'zfsonlinux/merge-spl'
Merge a minimal version of the zfsonlinux/spl repository in to the
zfsonlinux/zfs repository.  Care was taken to prevent file conflicts
when merging and to preserve the spl repository history.  The spl
kernel module remains under the GPLv2 license as documented by the
additional THIRDPARTYLICENSE.gplv2 file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-05-29 14:57:55 -07:00
Brian Behlendorf
a91258913f Prepare SPL repo to merge with ZFS repo
This commit removes everything from the repository except the core
SPL implementation for Linux.  Those files which remain have been
moved to non-conflicting locations to facilitate the merge.
The README.md and associated files have been updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-05-29 14:51:39 -07:00
Matthew Ahrens
0dc2f70c5c OpenZFS 9486 - reduce memory used by device removal on fragmented pools
Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

* Add the following as module parameters:
    * zfs_condense_indirect_vdevs_enable
    * zfs_condense_max_obsolete_bytes

* Document the following module parameters:
   * zfs_condense_indirect_vdevs_enable
   * zfs_condense_max_obsolete_bytes
   * zfs_condense_min_mapping_bytes

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c
External-issue: DLPX-57962
Closes #7536
2018-05-24 10:18:07 -07:00
Pavel Zakharov
38a19edd34 OpenZFS 9189 - Add debug to vdev_label_read_config when txg check fails
These changes were added to help debug issue #9187.

Essentially, in the original bug, vdev_validate() seems to fails in
vdev_label_read_config() and prints "failed reading config". This could
happen because either:
1. The labels are actually corrupt and zio_wait() fails for all of them
2. The labels were discarded because they didn't pass the txg check.

Beyond 9187, having debug info when case 2 happens could be useful in
other scenarios, such as zpool import.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by:  Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9189
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f6af1b7
Closes #7533
2018-05-14 14:32:49 -04:00
Pavel Zakharov
db7d07e14b OpenZFS 9191 - dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices
Add vdev_print_tree() in spa_check_for_missing_logs() when some log
devices are missing to ease debugging

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9191
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5c02e5
Closes #7531
2018-05-14 14:30:52 -04:00
Pavel Zakharov
a11c7aaec9 OpenZFS 9187 - racing condition between vdev label and spa_last_synced_txg in vdev_validate
ztest failed with uncorrectable IO error despite having the fix for
7163.  Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also
distinguishes it from that issue.

Definitely seems like a racing condition between the vdev_validate
and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to
   spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.

Solution: do not check txg in vdev_validate unless config lock is held.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9187
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/805fda72
Closes #7529
2018-05-14 14:28:09 -04:00
Brian Behlendorf
b669ab83bb
Ignore *.o.ur-safe build artifacts
Generated when building on Ubuntu 18.04.  Also ignore the new
dynamically generated zfs-mount-generator.8 man page, and the
module/.cache.mk file.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7534
2018-05-13 18:59:02 -07:00
Olaf Faaland
bc5f51c5de module param callbacks check for initialized spa
Callbacks provided for module parameters are executed both
after the module is loaded, when a user alters it via sysfs, e.g
	echo bar > /sys/modules/zfs/parameters/foo

as well as when the module is loaded with an argument, e.g.
	modprobe zfs foo=bar

In the latter case, the init functions likely have not run yet,
including spa_init() which initializes the namespace lock so it is safe
to use.

Instead of immediately taking the namespace lock and attemping to
iterate over initialized spa structures, check whether spa_mode_global
is nonzero.  This is set by spa_init() after it has initialized the
namespace lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7496 
Closes #7521
2018-05-11 12:46:07 -07:00
Tim Chase
d1043e2f6d Unify behavior of deadman parameters
The zfs_deadman_failmode, zfs_deadman_ziotime_ms and
zfs_deadman_synctime_ms paramaters are stored per-pool.  However,
only the zfs_deadman_failmode updates the per-pool state when it's
change.  This patch gives adds the same behavior to the other two
for consistency.

Also, in all 3 three cases, only update the per-pool parameters
if spa_init() has actually been called in order to avoid panicking
when trying to take a lock on the spa_namespace_lock mutex.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7499
2018-05-08 21:45:47 -07:00
Tim Chase
a0ad7ca54e Clear vdev_faulted
Clear vdev_faulted if ZPOOL_CONFIG_AUX_STATE is not set to "external"

ZoL supports "zpool export -f" (force fault), which can be combined
with "-t" (temporary fault; don't persist across export/import) and
causes a MOS configuration to be set with ZPOOL_CONFIG_FAULTED=1
and without ZFS_CONFIG_AUX_STATE set at all.  In this case, the
previously-offlined vdev should be imported in an on-line state and.
Clearing the "vdev_faulted" flag causes the import to treat the
device as on-line.  Typically, resilver will catch it up based on
its DTL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7459
2018-05-08 21:39:50 -07:00
Pavel Zakharov
6cb8e5306d OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

	7638 Refactor spa_load_impl into several functions
	8961 SPA load/import should tell us why it failed
	7277 zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.

Porting notes (ZTS):
* Fix 'make dist' target in zpool_import

* The maximum path length allowed by tar is 99 characters.  Several
  of the new test cases exceeded this limit resulting in them not
  being included in the tarball.  Shorten the names slightly.

* Set/get tunables using accessor functions.

* Get last synced txg via the "zfs_txg_history" mechanism.

* Clear zinject handlers in cleanup for import_cache_device_replaced
  and import_rewind_device_replaced in order that the zpool can be
  exported if there is an error.

* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
  ext4 file system to be created on ZFS_DISK2.  Also, there's
  no need to partition ZFS_DISK2 at all.  The partitioning had
  already been disabled for multipath devices.  Among other things,
  the partitioning steals some space from the ext4 file system,
  makes it difficult to accurately calculate the paramters to
  parted and can make some of the tests fail.

* Increase FS_SIZE and FILE_SIZE in the zpool_import test
  configuration now that FILESIZE is larger.

* Write more data in order that device evacuation take lonnger in
  a couple tests.

* Use mkdir -p to avoid errors when the directory already exists.

* Remove use of sudo in import_rewind_config_changed.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2018-05-08 21:35:27 -07:00
Pavel Zakharov
afd2f7b711 OpenZFS 8962 - zdb should work on non-idle pools
Currently `zdb` consistently fails to examine non-idle pools as it
fails during the `spa_load()` process. The main problem seems to be
that `spa_load_verify()` fails as can be seen below:

    $ sudo zdb -d -G dcenter
    zdb: can't open 'dcenter': I/O error

    ZFS_DBGMSG(zdb):
    spa_open_common: opening dcenter
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
    spa_load(dcenter): using uberblock with txg=40824950
    spa_load(dcenter): UNLOADING
    spa_load(dcenter): RELOADING
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
    spa_load(dcenter): using uberblock with txg=40824952
    spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
    spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from
`zdb`. This is done by creating a global flag in zfs and then setting
it in `zdb`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/180ad792
Closes #7459
2018-05-08 21:32:57 -07:00
Pavel Zakharov
4a0ee12af8 OpenZFS 8961 - SPA load/import should tell us why it failed
Problem
=======

When we fail to open or import a storage pool, we typically don't
get any additional diagnostic information, just "no pool found" or
"can not import".

While there may be no additional user-consumable information, we should
at least make this situation easier to debug/diagnose for developers
and support.  For example, we could start by using `zfs_dbgmsg()`
to log each thing that we try when importing, and which things
failed. E.g. "tried uberblock of txg X from label Y of device Z". Also,
we could log each of the stages that we go through in `spa_load_impl()`.

Solution
========

Following the cleanup to `spa_load_impl()`, debug messages have been
added to every point of failure in that function. Additionally,
debug messages have been added to strategic places, such as
`vdev_disk_open()`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8961
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0
Closes #7459
2018-05-08 21:30:10 -07:00
Paul Dagnelie
ca0845d59e OpenZFS 9256 - zfs send space estimation off by > 10% on some datasets
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Added tuning to man page.
* Test case changes dropped, default behavior unchanged.

OpenZFS-issue: https://www.illumos.org/issues/9256
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/32356b3c56
Closes #7470
2018-05-08 08:59:24 -07:00
LOLi
4ceb8dd6fd Fix 'zpool create -t <tempname>'
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.

Fix this by specifying the correct pool name when setting the dataset
properties.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7502 
Closes #7509
2018-05-07 21:11:58 -07:00
Brian Behlendorf
c02c1becce
ZTS: Re-enable MMP tests
Commit 7fab6361 inadvertently disabled the MMP test cases by creating
and not removing an /etc/hostid file in the new zpool_split_props test
case.  When the file exists the ZTS skips the entire MMP test group
rather than modify what may be a system which is already configured.
Update the test case to remove the file.

Additionally, because the MMP tests were disabled a regression slipped
in as part of commit 9eb7b46ed0.  Fix it.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7514
2018-05-07 21:08:33 -07:00
Paul Dagnelie
64c1dcefe3 OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
     the deleted queue

It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.

We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.

Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.

We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:

1. When putting an object on the delete queue, change it's parent object
   number to a known constant that means NULL.

2. When displaying objects, first check if it is present on the delete
   queue.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2018-05-04 10:50:24 -07:00
Matthew Ahrens
5e097c67f1 OpenZFS 9443 - panic when scrub a v10 pool
While expanding stored pools, we ran into a panic using an old pool.

Steps to reproduce:

    $ sudo zpool create -o version=2 test c2t1d0
    $ sudo cp /etc/passwd /test/foo
    $ sudo zpool attach test c2t1d0 c2t2d0

We'll get this panic:

    ffffff000fc0e5e0 unix:real_mode_stop_cpu_stage2_end+b27c ()
    ffffff000fc0e6f0 unix:trap+dc8 ()
    ffffff000fc0e700 unix:cmntrap+e6 ()
    ffffff000fc0e860 zfs:dsl_scan_visitds+1ff ()
    ffffff000fc0ea20 zfs:dsl_scan_visit+fe ()
    ffffff000fc0ea80 zfs:dsl_scan_sync+1b3 ()
    ffffff000fc0eb60 zfs:spa_sync+435 ()
    ffffff000fc0ec20 zfs:txg_sync_thread+23f ()
    ffffff000fc0ec30 unix:thread_start+8 ()

The problem is a bad trap accessing a NULL pointer. We're looking for
the dp_origin_snap of a dsl_pool_t, but version 2 didn't have that. The
system will go into a reboot loop at this point, and the dump won't be
accessible except by removing the cache file from within the recovery
environment.

This impacts any sort of scrub or resilver on version <11 pools, e.g.:

    $ zpool create -o version=10 test c2t1d0
    $ zpool scrub test

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9443
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/010eed29
Closes #7501
2018-05-04 10:47:10 -07:00
Tom Caputi
be9a5c355c Add support for decryption faults in zinject
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.

This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7474
2018-05-02 15:36:20 -07:00
Brian Behlendorf
9464b9591e
RHEL 7.5 compat: FMODE_KABI_ITERATE
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.

Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:

* Requires that callers intending to use this extended interface
  set the FMODE_KABI_ITERATE flag on the file structure when
  opening the directory.
* Adds the fops.iterate() method to the end of the structure,
  without removing fops.readdir().

This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected.  Instead fallback to
the fops.readdir() interface which will be available.

Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7460 
Closes #7463
2018-05-02 15:01:24 -07:00
Brian Behlendorf
bc8a6a60e9
Fix inst_num overflow in qat_crypt.c
This patch fixes the same issue which was previously addressed in
6051.  The variable "inst_num" was of the incorrect type and
"atomic_inc_32_nv()" could cause an overflow damaging its neighbor.

Cast the return value of atomic_inc_32_nv() to Cpa32U.

Fix a few types for num_inst for clarity.

Reviewed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7468
2018-05-01 20:44:24 -07:00
Tom Caputi
2c24b5b148 Fix issues found with zfs diff
Two deadlocks / ASSERT failures were introduced in a2c2ed1b which
would occur whenever arc_buf_fill() failed to decrypt a block of
data. This occurred because the call to arc_buf_destroy() which
was responsible for cleaning up the newly created buffer would
attempt to take out the hdr lock that it was already holding. This
was resolved by calling the underlying functions directly without
retaking the lock.

In addition, the dmu_diff() code did not properly ensure that keys
were loaded and mapped before begining dataset traversal. It turns
out that this code does not need to look at any encrypted values,
so the code was altered to perform raw IO only.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7354 
Closes #7456
2018-05-01 11:24:20 -07:00
Tomohiro Kusumi
d6133fc500 Silence compile-time warning on unused variable
ASSERT3U() could be NOP which then leads to having unused pointer *spa.

metaslab.c: In function 'metaslab_condense':
metaslab.c:2075:9: warning: unused variable 'spa' [-Wunused-variable]
  spa_t *spa = msp->ms_group->mg_vd->vdev_spa;

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7489
2018-05-01 11:15:54 -07:00
loli10K
85ce3f4fd1 Adopt pyzfs from ClusterHQ
This commit introduces several changes:

 * Update LICENSE and project information

 * Give a good PEP8 talk to existing Python source code

 * Add RPM/DEB packaging for pyzfs

 * Fix some outstanding issues with the existing pyzfs code caused by
   changes in the ABI since the last time the code was updated

 * Integrate pyzfs Python unittest with the ZFS Test Suite

 * Add missing libzfs_core functions: lzc_change_key,
   lzc_channel_program, lzc_channel_program_nosync, lzc_load_key,
   lzc_receive_one, lzc_receive_resumable, lzc_receive_with_cmdprops,
   lzc_receive_with_header, lzc_reopen, lzc_send_resume, lzc_sync,
   lzc_unload_key, lzc_remap

Note: this commit slightly changes zfs_ioc_unload_key() ABI. This allow
to differentiate the case where we tried to unload a key on a
non-existing dataset (ENOENT) from the situation where a dataset has
no key loaded: this is consistent with the "change" case where trying
to zfs_ioc_change_key() from a dataset with no key results in EACCES.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7230
2018-05-01 10:33:35 -07:00
Alexander Motin
20507534d4 OpenZFS 9434 - Speculative prefetch is blocked by device removal code
Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal.  At least one visual consequence
of it is completely blocked speculative prefetcher.  This patch sets
the variable in such situations.

Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9434
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/16127b627b
Closes #7480
2018-04-30 13:05:55 -07:00
Matthew Ahrens
964c2d69a9 OpenZFS 9236 - nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5

OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668
Closes #7467
2018-04-30 10:19:48 -07:00
Mark Wright
089500e792 Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build
Fix build errors with gcc 7.3.0 on Gentoo with kernel 4.16.3
built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as:

module/zfs/vdev_indirect.c:296:2: error:
positional initialization of field in ‘struct’ declared with
‘designated_init’ attribute [-Werror=designated-init]
  vdev_indirect_map_free,
  ^~~~~~~~~~~~~~~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Mark Wright <gienah@gentoo.org>
Closes #7464
2018-04-20 09:53:25 -07:00
Chunwei Chen
599b864813 Fix ENOSPC in "Handle zap_add() failures in ..."
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7401 
Closes #7421
2018-04-18 14:19:50 -07:00
Tom Caputi
b0ee5946aa Fix issues with raw sends of spill blocks
This patch fixes 2 issues in how spill blocks are processed during
raw sends. The first problem is that compressed spill blocks were
using the logical length rather than the physical length to
determine how much data to dump into the send stream. The second
issue is a typo that caused the spill record's object number to be
used where the objset's ID number was required. Both issues have
been corrected, and the payload_size is now printed in zstreamdump
for future debugging.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7378 
Closes #7432
2018-04-17 11:19:03 -07:00
Tom Caputi
e14a32b1c8 Fix object reclaim when using large dnodes
Currently, when the receive_object() code wants to reclaim an
object, it always assumes that the dnode is the legacy 512 bytes,
even when the incoming bonus buffer exceeds this length. This
causes a buffer overflow if --enable-debug is not provided and
triggers an ASSERT if it is. This patch resolves this issue and
adds an ASSERT to ensure this can't happen again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7097
Closes #7433
2018-04-17 11:13:57 -07:00
Matthew Ahrens
0c03d21ac9 assertion in arc_release() during encrypted receive
In the existing code, when doing a raw (encrypted) zfs receive, 
we call arc_convert_to_raw() from open context. This creates a 
race condition between arc_release()/arc_change_state() and 
writing out the block from syncing context (arc_write_ready/done()).

This change makes it so that when we are doing a raw (encrypted) 
zfs receive, we save the crypt parameters (salt, iv, mac) of dnode 
blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() 
from syncing context when writing out the block of dnodes.

Additionally, we can eliminate dr_raw and associated setters, and 
instead know that dnode blocks are always raw when doing a zfs 
receive (see the new field os_raw_receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7424 
Closes #7429
2018-04-17 11:06:54 -07:00
Matthew Ahrens
7f96cc23ac OpenZFS 9192 - explicitly pass good_writes to vdev_uberblock/label_sync
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

OpenZFS-issue: https://www.illumos.org/issues/9192
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f4c0b602d
Closes #7446
2018-04-17 10:45:47 -07:00
Matt Ahrens
d830d4795a OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices
Authored by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c
Closes #7445
2018-04-17 10:44:50 -07:00
megari
d68ac65eb6 Revert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier"
This reverts commit cbb8933215.

The original change in OpenZFS 9036 did remove duplicate 'const'
specifiers, but the ZoL port had already done what *should* have been
done in OpenZFS 9036, which is to make the pointers themselves const.
The port of the change to ZoL ended up doing an unnecessary removal
of the constness of the pointers. Undo that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Closes #7444
2018-04-16 12:44:40 -07:00
Pavel Zakharov
9eb7b46ed0 OpenZFS 7638 - Refactor spa_load_impl into several functions
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7638
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fd3785ff6
Closes #7437
2018-04-16 12:24:23 -07:00
Tim Chase
5284f43a1e Avoid Linux hung task message in ZTHR
Use an interruptible to avoid Linux hung task message in
ZTHR and to prevent inflating the load average.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7440 
Closes #7441
2018-04-15 15:12:28 -07:00
Toomas Soome
5e567da987 OpenZFS 9213 - zfs: sytem typo
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The additional instances of this typo addressed in the OpenZFS
  patch were already resolved.

OpenZFS-issue: https://illumos.org/issues/9213
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edc8ef7d92
Closes #7436
2018-04-15 10:59:13 -07:00
Toomas Soome
cbb8933215 OpenZFS 9036 - zfs: duplicate 'const' declaration specifier
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9036
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f02c28e434
Closes #6900
2018-04-14 12:40:52 -07:00
Prakash Surya
eecdd8e884 OpenZFS 9084 - spa_*_ashift must ignore spare devices
It's possible for the following assertion to be tripped when
running ztest:

    assertion failed for thread 0xf09fca40, thread-id 549:
    spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9),
    file ../../../uts/common/fs/zfs/vdev_removal.c, line 965

    > $c
    libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e)
    libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000)
    libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0)
    libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74,
        ebddef68, 8992dc0, ebe10a00, fef073c0)
    libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a)
    libc.so.1`_thrp_setup+0x8c(f09fca40)
    libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0)

    > ::spa -v
    ADDR         STATE NAME
    08723000    ACTIVE ztest

        ADDR     STATE     AUX          DESCRIPTION
        087466c0 HEALTHY   -            root
        087450c0 HEALTHY   -              /rpool/tmp/ztest.0a
        08745640 HEALTHY   -              indirect
        08745bc0 HEALTHY   -              /rpool/tmp/ztest.2a
        08746140 HEALTHY   -              /rpool/tmp/ztest.3a
        -        -         -            spares
        08744b40 HEALTHY   -              /rpool/tmp/ztest.spares.0

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9084
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/18acba7
Closes #6900
2018-04-14 12:40:52 -07:00
Serapheim Dimitropoulos
4bf8108ede OpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9080
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bdfded42e6
Closes #6900
2018-04-14 12:40:47 -07:00
Serapheim Dimitropoulos
9d5b524597 OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2018-04-14 12:23:53 -07:00
Brian Behlendorf
4589f3ae4c Optimize possible split block search space
Remove duplicate segment copies to minimize the possible search
space for reconstruction.  Once reduced an accurate assessment can
be made regarding the difficulty in reconstructing the block.

Also, ztest will now run zdb with
zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt
to avoid checksum errors.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6900
2018-04-14 12:22:43 -07:00
Matthew Ahrens
9e052db462 OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-04-14 12:21:39 -07:00
Matthew Ahrens
a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Seth Forshee
93b43af10d Allow mounting datasets more than once
Currently mounting an already mounted zfs dataset results in an
error, whereas it is typically allowed with other filesystems.
This causes some bad interactions with mount namespaces. Take
this sequence for example:

- Create a dataset
- Create a snapshot of the dataset
- Create a clone of the snapshot
- Create a new mount namespace
- Rename the original dataset

The rename results in unmounting and remounting the clone in the
original mount namespace, however the remount fails because the
dataset is still mounted in the new mount namespace. (Note that
this means the mount in the new mount namespace is never being
unmounted, so perhaps the unmount/remount of the clone isn't
actually necessary.)

The problem here is a result of the way mounting is implemented
in the kernel module. Since it is not mounting block devices it
uses mount_nodev() instead of the usual mount_bdev(). However,
mount_nodev() is written for filesystems for which each mount is
a new instance (i.e. a new super block), and zfs should be able
to detect when a mount request can be satisfied using an existing
super block.

Change zpl_mount() to call sget() directly with it's own test
callback. Passing the objset_t object as the fs data allows
checking if a superblock already exists for the dataset, and in
that case we just need to return a new reference for the sb's
root dentry.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Closes #5796
Closes #7207
2018-04-13 10:44:05 -07:00
beren12
7403d0743e Fix zfs_arc_max minimum tuning
When setting `zfs_arc_max` its minimum value is allowed
to be 64 MiB.  There was an off-by-1 error which can matter
on tiny systems.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Zubrzycki <github@mid-earth.net>
Closes #7417
2018-04-12 10:47:32 -07:00
Tom Caputi
edc1e713c2 Fix race in dnode_check_slots_free()
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147 
Closes #7388
2018-04-10 11:15:05 -07:00
Giuseppe Di Natale
10f88c5cd5 Linux compat 4.16: blk_queue_flag_{set,clear}
queue_flag_{set,clear}_unlocked are now private interfaces in
the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14).
Use blk_queue_flag_{set,clear} interfaces which were introduced as
of https://github.com/torvalds/linux/commit/8814ce8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7410
2018-04-10 10:32:14 -07:00
Tony Hutter
4f301661df Revert "Handle zap_add() failures in mixed ... "
This reverts commit cc63068e95.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401 
Cloes #7416
2018-04-09 14:24:46 -07:00
Brian Behlendorf
3b0d99289a
Fix 'zfs send/recv' hang with 16M blocks
When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7365 
Closes #7404
2018-04-08 19:41:15 -07:00
Matthew Ahrens
5c27ec1088 Fixes for SNPRINTF_BLKPTR with encrypted BP's
mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.

Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7390
2018-04-06 13:30:26 -07:00
Olaf Faaland
0ba106e75c Fix divide-by-zero in mmp_delay_update()
vdev_count_leaves() in the denominator may return 0, caught by Coverity.
Introduced by

* 533ea04 Update mmp_delay on sync or skipped, failed write

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7391
2018-04-06 13:29:11 -07:00
Olaf Faaland
533ea0415b Update mmp_delay on sync or skipped, failed write
When an MMP write is skipped, or fails, and time since
mts->mmp_last_write is already greater than mts->mmp_delay, increase
mts->mmp_delay.  The original code only updated mts->mmp_delay when a
write succeeded, but this results in the write(s) after delays and
failed write(s) reporting an ub_mmp_delay which is too low.

Update mmp_last_write and mmp_delay if a txg sync was successful.  At
least one uberblock was written, thus extending the time we can be sure
the pool will not be imported by another host.

Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) /
vdev_count_leaves()) so that a period of frequent successful MMP writes,
e.g. due to frequent txg syncs, does not result in an import activity
check so short it is not reliable based on mmp thread writes alone.

Remove unnecessary local variable, start.  We do not use the start time
of the loop iteration.

Add a debug message in spa_activity_check() to allow verification of the
import_delay value and to prove the activity check occurred.

Alter the tests that import pools and attempt to detect an activity
check.  Calculate the expected duration of spa_activity_check() based on
module parameters at the time the import is performed, rather than a
fixed time set in mmp.cfg.  The fixed time may be wrong.  Also, use the
default zfs_multihost_interval value so the activity check is longer and
easier to recognize.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7330
2018-04-04 16:38:44 -07:00
Tony Hutter
21a4f5cc86 Fedora 28: Fix misc bounds check compiler warnings
Fix a bunch of (mostly) sprintf/snprintf truncation compiler
warnings that show up on Fedora 28 (GCC 8.0.1).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7361 
Closes #7368
2018-04-04 10:16:47 -07:00
LOLi
1724eb62de Fix spa reference leak in zfs_ioc_pool_scan
zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a
valid pool_scrub_cmd_t: this could happen if the userland binaries
and ZFS kernel module differ in version and would prevent the pool from
being exported.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7380
2018-04-03 17:31:30 -07:00
Tim Chase
10adee27ce Remove ASSERT() in l2arc_apply_transforms()
The ASSERT was erroneously copied from the next section of code.
The buffer's size should be expanded from "psize" to "asize"
if necessary.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7375
2018-03-31 15:14:21 -07:00
Tom Caputi
a2c2ed1bd4 Decryption error handling improvements
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.

In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7372
2018-03-31 11:12:51 -07:00
Tom Caputi
4515b1d01c Encrypted dnode blocks should be prefetched raw
Encrypted dnode blocks are always initially read as raw data and
converted to decrypted data when an encrypted bonus buffer is
needed. This allows the DMU to be used for things like fetching
the DMU master node without requiring keys to be loaded. However,
dbuf_issue_final_prefetch() does not currently read the data as
raw. The end result of this is that prefetched dnode blocks are
read twice from disk: once decrypted and then again as raw data.
This patch corrects the issue by adding the flag when appropriate.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7362
2018-03-31 11:11:48 -07:00
LOLi
77d8a0f1a4 Fix hung z_zvol tasks during 'zfs receive'
During a receive operation zvol_create_minors_impl() can wait
needlessly for the prefetch thread because both share the same tasks
queue.  This results in hung tasks:

<3>INFO: task z_zvol:5541 blocked for more than 120 seconds.
<3>      Tainted: P           O  3.16.0-4-amd64
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The first z_zvol:5541 (zvol_task_cb) is waiting for the long running
traverse_prefetch_thread:260

root@linux:~# cat /proc/spl/taskq
taskq                       act  nthr  spwn  maxt   pri  mina
spl_system_taskq/0            1     2     0    64   100     1
	active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40)
	wait: 5541
spl_delay_taskq/0             0     1     0     4   100     1
	delay: spa_deadman [zfs](0xffff880039924000)
z_zvol/1                      1     1     0     1   120     1
	active: [5541]zvol_task_cb [zfs](0xffff88001fde6400)
	pend: zvol_task_cb [zfs](0xffff88001fde6800)

This change adds a dedicated, per-pool, prefetch taskq to prevent the
traverse code from monopolizing the global (and limited) system_taskq by
inappropriately scheduling long running tasks on it.

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6330 
Closes #6890 
Closes #7343
2018-03-30 12:10:01 -07:00
Andriy Gapon
5e00213e43 OpenZFS 9164 - assert: newds == os->os_dsl_dataset
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
  to successfully run in under 5 minutes.

OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a
Closes #6112
Closes #7336
2018-03-30 12:00:40 -07:00
Tom Caputi
32dce2da0c Resolve QAT issues with incompressible data
Currently, when ZFS wants to accelerate compression with QAT, it
passes a destination buffer of the same size as the source buffer.
Unfortunately, if the data is incompressible, QAT can actually
"compress" the data to be larger than the source buffer. When this
happens, the QAT driver will return a FAILED error code and print
warnings to dmesg. This patch fixes these issues by providing the
QAT driver with an additional buffer to work with so that even
completely incompressible source data will not cause an overflow.

This patch also resolves an error handling issue where
incompressible data attempts compression twice: once by QAT and
once in software. To fix this issue, a new (and fake) error code
CPA_STATUS_INOMPRESSIBLE has been added so that the calling code
can correctly account for the difference between a hardware
failure and data that simply cannot be compressed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7338
2018-03-29 17:40:34 -07:00
Tom Caputi
13a2ff2727 Fix ASSERT in dsl_scan_fini() and cleanup comments
This patch fixes an issue where dsl_scan_prefetch_cb() might
add more prefetch I/Os to the prefetch queue after prefetching
has been completed. This was happening because that code was
checking scn->scn_suspending instead of scn->scn_prefetch_stop.
This occasionally triggered an ASSERT during ztest runs in
dsl_scan_fini() when the code attempted to destroy an AVL tree
that still had entires in it. This patch also includes a number
of spelling corrections and comment cleanups throughout
dsl_scan.c

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7353
2018-03-28 18:30:44 -07:00
Brian Behlendorf
b2ab468dde
Fix mmap / libaio deadlock
Calling uiomove() in mappedread() under the page lock can result
in a deadlock if the user space page needs to be faulted in.

Resolve the issue by dropping the page lock before the uiomove().
The inode range lock protects against concurrent updates via
zfs_read() and zfs_write().

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7335 
Closes #7339
2018-03-28 10:19:22 -07:00
Allan Jude
5152a74088 OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value
Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9321
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18
Closes #7333
2018-03-26 20:40:15 -07:00
Tom Caputi
157ef7f6a5 Don't count embedded bps in read stats
Currently, ZFS tracks statistics about calls to arc_read()
via the /proc/spl/kstat/zfs/<pool>/reads file for debugging.
Unfortunately, this file currently counts embedded bps as
disk reads since they are technically processed by the ZIO
layer. This pollutes the log since the ARC will never cache
embedded bps. This patch  corrects this issue by preventing
the logging of embedded bp reads.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7334
2018-03-23 21:35:19 -07:00
Matthew Ahrens
2fd92c3d6c enable zfs_dbgmsg() by default, without dprintf()
zfs_dbgmsg() should record a message by default.  As a general
principal, these messages shouldn't be too verbose.  Furthermore, the
amount of memory used is limited to 4MB (by default).

dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags.  This flag is not set by default
(even on debug builds).  These messages are extremely verbose, and
sometimes nontrivial to compute.

SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags.  This flag is not set by default (even on debug builds).

This brings our behavior in line with illumos.  Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7314
2018-03-21 15:37:32 -07:00
Tom Caputi
8d9e7c8fbe Fix spelling errors in comments
This patch simply corrects some spelling / grammar errors in
the QAT and encryption code comments. No functional changes

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7319
2018-03-21 08:42:13 -07:00
Brian Behlendorf
a76f3d0437
Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock.  This can result in a deadlock
as shown in the stack traces below.

Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed.  This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7307
2018-03-16 16:46:06 -07:00
Tom Caputi
17dd88352e Allow QAT to handle non page-aligned buffers
This patch adds some handling to the QAT acceleration functions
that allows them to handle buffers that are not aligned with the
page cache. At the moment this never happens since callers only
happen to work with page-aligned buffers, but this code should
prevent headaches if this isn't always true in the future. This
patch also adds some cleanups to align the QAT compression code
with the encryption and checksumming code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7305
2018-03-16 16:02:49 -07:00
Olaf Faaland
cec3a0a1bb Report pool suspended due to MMP
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.

Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.

When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.

In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.

In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7296
2018-03-15 10:56:55 -07:00
Tom Caputi
3874220932 SHA256 QAT acceleration
This patch enables acceleration of SHA256 checksums using Intel
Quick Assist Technology. This patch also fixes up and refactors
some of the code from QAT encryption to make the behavior
consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7295
2018-03-15 10:53:58 -07:00
Paul Zuchowski
1a2342784a receive_spill does not byte swap spill contents
In zfs receive, the function receive_spill should account
for spill block endian conversion as a defensive measure.
 
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7300
2018-03-15 10:29:51 -07:00
Brian Behlendorf
de4f8d5d26
OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression
With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress
indirect blocks, under a workload of random cached reads. To reduce this
decompression cost, we would like to increase the size of the dbuf cache so
that more indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the dbuf
cache as well.)

In real world workloads, this won't help as dramatically as the example above,
but we think it's still worth it because the risk of decreasing performance is
low. The potential negative performance impact is that we will be slightly
reducing the size of the ARC (by ~3%).

Porting Notes:
* Added modules options to zfs-module-parameters.5 man page.
* Preserved scaling based on target ARC size rather than max ARC size.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9188
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564
Upstream bug: DLPX-46942
Closes #7273
2018-03-13 10:52:48 -07:00
Brian Behlendorf
a6cc97566c
Add kernel module auto-loading
Historically a dynamic misc minor number was registered for the
/dev/zfs device in order to prevent minor number collisions.  This
was fine but it prevented us from being able to use the kernel
module auto-loaded which requires a known reserved value.

Resolve this issue by adding a configure test to find an available
misc minor number which can then be used in MODULE_ALIAS_MISCDEV at
build time.  By adding this alias the zfs kmod is added to the list
of known static-nodes and the systemd-tmpfiles-setup-dev service
will create a /dev/zfs character device at boot time.

This in turn allows us to update the 90-zfs.rules file to make it
aware this is a static node.  The upshot of this is that whenever
a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be
automatic loaded.  This even works for unprivileged users so there
is no longer a need to manually load the modules at boot time.

As an additional bonus the zed now no longer needs to start after
the zfs-import.service since it will trigger the module load.

In the unlikely event the minor number we selected conflicts with
another out of tree unregistered minor number the code falls back
to dynamically allocating it.  In this case the modules again
must be manually loaded.

Note that due to the change in the method of registering the minor
number the zimport.sh test case may incorrectly fail when the
static node for the installed packages is created instead of the
dynamic one.  This issue will only transiently impact zimport.sh
for this single commit when we transition and are mixing and
matching methods.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7287
2018-03-13 10:45:55 -07:00
Tim Chase
02638a30ef Add zfs_scan_ignore_errors tunable
When it's set, a DTL range will be cleared even if its scan/scrub had
errors.  This allows to work around resilver/scrub upon import when the
pool has errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7293
2018-03-13 10:43:14 -07:00
Chunwei Chen
9470cbd4f9 Fix race in trace point in zrl_add_impl
We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.

==
For example:

Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"

Thread A                                 Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
  DTRACE_PROBE2() {
    __string() {
      strlen(zrl->zr_caller) -> 3
      allocate buf[4]
    }

                                        zrl->zr_owner = B
				        zrl->zr_caller = "abcd"

    __assign_str() {
      strcpy(buf, zrl->zr_caller) <- buffer overflow
==

Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.

To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7291
2018-03-12 11:27:02 -07:00
Brian Behlendorf
b7eec00f9f
Fix MMP write frequency for large pools
When a single pool contains more vdevs than the CONFIG_HZ for
for the kernel the mmp thread will not delay properly.  Switch
to using cv_timedwait_sig_hires() to handle higher resolution
delays.

This issue was reported on Arch Linux where HZ defaults to only
100 and this could be fairly easily reproduced with a reasonably
large pool.  Most distribution kernels set CONFIG_HZ=250 or
CONFIG_HZ=1000 and thus are unlikely to be impacted.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7205 
Closes #7289
2018-03-12 11:26:05 -07:00
Olaf Faaland
743253df70 Hold SCL_VDEV when counting leaves
A config lock should be held while vdev_count_leaves() walks the tree,
otherwise the pointers reference may become invalid during the walk.

SCL_VDEV is a minimal lock provided for such uses cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286
2018-03-09 15:42:54 -08:00
Olaf Faaland
ebed90a598 Handle zio_resume and mmp => off
When multihost is disabled on a pool, and the pool is resumed via zpool
clear, within a single cycle of the mmp thread's loop (e.g.  while it's
in the cv_timedwait call), both mmp_last_write and mmp_delay should be
updated.

The original code mistakenly treated the two cases as if they could not
occur at the same time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286
2018-03-09 15:42:11 -08:00
Tom Caputi
cf63739191 QAT support for AES-GCM
This patch adds support for acceleration of AES-GCM encryption
with Intel Quick Assist Technology.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7282
2018-03-09 13:37:15 -08:00
Wolfgang Bumiller
0e85048f53 Take user namespaces into account in policy checks
Change file related checks to use user namespaces and make
sure involved uids/gids are mappable in the current
namespace.

Note that checks without file ownership information will
still not take user namespaces into account, as some of
these should be handled via 'zfs allow' (otherwise root in a
user namespace could issue commands such as `zpool export`).

This also adds an initial user namespace regression test
for the setgid bit loss, with a user_ns_exec helper usable
in further tests.

Additionally, configure checks for the required user
namespace related features are added for:
  * ns_capable
  * kuid/kgid_has_mapping()
  * user_ns in cred_t

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Closes #6800 
Closes #7270
2018-03-07 15:40:42 -08:00
Olaf Faaland
d2160d0538 Record skipped MMP writes in multihost_history
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:15:15 -08:00
Olaf Faaland
14c240cede Detect long config lock acquisition in mmp
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:14:39 -08:00
Nasf-Fan
2705ebf0a7 Misc fixes and cleanup for project quota
1) The Coverity Scan reports some issues for the project
   quota patch, including:

1.1) zfs_prop_get_userquota() directly uses the const quota
   type value as the condition check by wrong.

1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid
   to be overwritten by dnode::dn->dn_oldprojid.

2) This patch fixes related issues. It also enhances the logic
   for zfs_project_item_alloc() to avoid buffer overflow.

3) Skip project quota ability check if does not change project
   quota related things (id or flag). Otherwise, it will cause
   chattr (for other non project quota flags) operation failed
   if project quota disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
Closes #7251 
Closes #7265
2018-03-05 12:56:27 -08:00
Giuseppe Di Natale
dd3e1e3083 Linux 4.16 compat: get_disk_and_module()
As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk()
is now get_disk_and_module(). Add a configure check to determine
if we need to use get_disk_and_module().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7264
2018-03-05 12:44:35 -08:00
Tony Hutter
80d52c3919 Change checksum & IO delay ratelimit values
Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.

Also, convert the checksum and IO delay thresholds to module params for
easy testing.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7252
2018-03-04 17:34:51 -08:00
chrisrd
5666a994f2 Increment zil_itx_needcopy_bytes properly
In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw)
into the log write block (lwb) and send it off using zil_lwb_add_txg().
If we also have WR_NEED_COPY, we additionally copy the lwr's data into
the lwb to be sent off.  If the lwr + data doesn't fit into the lwb, we
send the lrw and as much data as will fit (dnow bytes), then go back
and do the same with the remaining data.

Each time through this loop we're sending dnow data bytes. I.e.
zil_itx_needcopy_bytes should be incremented by dnow.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6988 
Closes #7176
2018-03-02 10:01:53 -08:00
Tom Caputi
095495e008 Raw DRR_OBJECT records must write raw data
b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7215 
Closes #7236
2018-02-27 09:04:05 -08:00
Brian Behlendorf
3673d03285
Fix more cstyle warnings
This patch contains no functional changes.  It is solely intended
to resolve cstyle warnings in order to facilitate moving the spl
source code in to the zfs repository.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #687
2018-02-24 10:05:37 -08:00
chrisrd
e9a7729008 Fix free memory calculation on v3.14+
Provide infrastructure to auto-configure to enum and API changes in the
global page stats used for our free memory calculations.

arc_free_memory has been broken since an API change in Linux v3.14:

2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node
2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node
  vmstats

These commits moved some of global_page_state() into
global_node_page_state(). The API change was particularly egregious as,
instead of breaking the old code, it silently did the wrong thing and we
continued using global_page_state() where we should have been using
global_node_page_state(), thus indexing into the wrong array via
NR_SLAB_RECLAIMABLE et al.

There have been further API changes along the way:

2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to
  node counters
2017-09-06 v4.14 c41f012a mm: rename global_page_state to
  global_zone_page_state

...and various (incomplete, as it turns out) attempts to accomodate
these changes in ZoL:

2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats
2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc
2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc

The config infrastructure provided here resolves these issues going back
to the original API change in v3.14 and is robust against further Linux
changes in this area.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7170
2018-02-23 08:50:06 -08:00
Olaf Faaland
7088545d01 Report duration and error in mmp_history entries
After an MMP write completes, update the relevant mmp_history entry
with the time between submission and completion, and the error
status of the write.

[faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost
39 0 0x01 100 8800 69147946270893 72723903122926
id       txg     timestamp  error  duration   mmp_delay    vdev_guid
10607    1166    1518985089 0      138301     637785455    4882...
10608    1166    1518985089 0      136154     635407747    1151...
10609    1166    1518985089 0      803618560  633048078    9740...
10610    1166    1518985090 0      144826     633048078    4882...
10611    1166    1518985090 0      164527     666187671    1151...

Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and
error = zio->io_error.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7190
2018-02-22 15:34:34 -08:00
Olaf Faaland
0d398b2564 Do not initiate MMP writes while pool is suspended
While the pool is suspended on host A, it may be imported on host B.
If host A continued to write MMP blocks, it would be blindly
overwriting MMP blocks written by host B, and the blocks written by
host A would have outdated txg information.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7182
2018-02-22 09:14:46 -08:00
Tom Caputi
f8478fc2ca Fix bounds check in zio_crypt_do_objset_hmacs
The current bounds check in zio_crypt_do_objset_hmacs() does not
properly handle the possible sizes of the objset_phys_t and
can therefore read outside the buffer's memory. If that memory
happened to match what the check was actually looking for, the
objset would fail to be owned, complaining that the MAC was
invalid.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7210
2018-02-22 08:50:14 -08:00
Tomohiro Kusumi
378c6ed549 Fix function name typos
vn_init() and vn_fini() had been renamed by 12ff95ff in 2011.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #686
2018-02-21 15:12:10 -08:00
Tomohiro Kusumi
68386b0503 Staticize kstat_default_update()
This is only used via ->ks_update of `kstat_t *`.
This isn't exported nor do headers have its prototype.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #686
2018-02-21 15:11:38 -08:00
Toomas Soome
09302a4ca8 OpenZFS 9035 - zfs: this statement may fall through
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9035
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/46ac8fdfc5
Closes #7206
2018-02-21 14:55:34 -08:00
Tom Caputi
b0918402dc Raw receive should change key atomically
Currently, raw zfs sends transfer the encrypted master keys and
objset_phys_t encryption parameters in the DRR_BEGIN payload of
each send file. Both of these are processed as soon as they are
read in dmu_recv_stream(), meaning that the new keys are set
before the new snapshot is received. In addition to the fact that
this changes the user's keys for the dataset earlier than they
might expect, the keys were never reset to what they originally
were in the event that the receive failed. This patch splits the
processing into objset handling and key handling, the later of
which is moved to dmu_recv_end() so that they key change can be
done atomically.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7200
2018-02-21 12:31:03 -08:00
Tom Caputi
4a385862b7 Prevent raw zfs recv -F if dataset is unencrypted
The current design of ZFS encryption only allows a dataset to
have one DSL Crypto Key at a time. As a result, it is important
that the zfs receive code ensures that only one key can be in use
at a time for a given DSL Directory. zfs receive -F complicates
this, since the new dataset is received as a clone of the existing
one so that an atomic switch can be done at the end. To prevent
confusion about which dataset is actually encrypted a check was
added to ensure that encrypted datasets cannot use zfs recv -F to
completely replace existing datasets. Unfortunately, the check did
not take into account unencrypted datasets being overriden by
encrypted ones as a case.

Along the same lines, the code also failed to ensure that raw
recieves could not be done on top of existing unencrypted
datasets, which causes amny problems since the new stream cannot
be decrypted.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7199
2018-02-21 12:30:11 -08:00
Tom Caputi
b1d217338a Raw receives must compress metadnode blocks
Currently, the DMU relies on ZIO layer compression to free LO
dnode blocks that no longer have objects in them. However,
raw receives disable all compression, meaning that these blocks
can never be freed. In addition to the obvious space concerns,
this could also cause incremental raw receives to fail to mount
since the MAC of a hole is different from that of a completely
zeroed block.

This patch corrects this issue by adding a special case in
zio_write_compress() which will attempt to compress these blocks
to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also
removes the zfs_mdcomp_disable tunable, since tuning it could
cause these same issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7198
2018-02-21 12:28:52 -08:00
Tom Caputi
5121c4fb0c Remove unnecessary txg syncs from receive_object()
1b66810b introduced serveral changes which improved the reliability
of zfs sends when large dnodes were involved. However, these fixes
required adding a few calls to txg_wait_synced() in the DRR_OBJECT
handling code. Although most of them are currently necessary, this
patch allows the code to continue without waiting in some cases
where it doesn't have to.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7197
2018-02-21 12:26:51 -08:00
Tom Caputi
478b3150de Add omitted set for os->os_next_write_raw
This one line patch adds adds a set to os->os_next_write_raw
that was omitted when the code was updated in 1b66810. Without
it, the code (in some instances) could attempt to write raw
encrypted data as regular unencrypted data without the keys
being loaded, triggering an ASSERT in zio_encrypt().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7196
2018-02-21 12:24:37 -08:00
Tom Caputi
163a8c28dd ZIL claiming should not start user accounting
Currently, ZIL claiming dirties objsets which causes
dsl_pool_sync() to attempt to perform user accounting on
them. This causes problems for encrypted datasets that were
raw received before the system went offline since they
cannot perform user accounting until they have their keys
loaded. This triggers an ASSERT in zio_encrypt(). Since
encryption was added, the code now depends on the fact that
data should only be written when objsets are owned. This
patch adds a check in dmu_objset_do_userquota_updates()
to ensure that useraccounting is only done when the objsets
are actually owned for write. As part of this work, the
zfsvfs and zvol code was updated so that it no longer lies
about owning objsets readonly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6916 
Closes #7163
2018-02-20 16:27:31 -08:00
Don Brady
cbce581353 Fix coverity defects: zfs channel programs
CID 173243, 173245:  Memory - corruptions  (OVERRUN)
 Added size argument to lcompat_sprintf() to avoid use of INT_MAX

CID 173244:  Integer handling issues  (OVERFLOW_BEFORE_WIDEN)
 Added cast to uint64_t to avoid a 32 bit overflow warning

CID 173242:  Integer handling issues  (CONSTANT_EXPRESSION_RESULT)
 Conditionally removed unused luai_numisnan() floating point check

CID 173241:  Resource leaks  (RESOURCE_LEAK)
 Added missing close(fd) on error path

CID 173240:    (UNINIT)
Fixed uninitialized variable in get_special_prop()

CID 147560:  Null pointer dereferences  (NULL_RETURNS)
Cleaned up bad code merge in dsl_dataset_promote_check()

CID 28475:  Memory - illegal accesses  (OVERRUN)
Fixed lcompat_sprintf() to use a size paramater

CID 28418, 28422:  Error handling issues  (CHECKED_RETURN)
Added function result cast to (void) to avoid warning

CID 23935, 28411, 28412:  Memory - corruptions  (ARRAY_VS_SINGLETON)
Added casts to avoid exposing result as an array

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7181
2018-02-20 11:19:42 -08:00
Tom Caputi
7b30ee6baf Project dnode should be protected by local MAC
This patch corrects a small security issue with 9c5167d1. When the
project dnode was added to the objset_phys_t, it was not included
in the local MAC for cryptographic protection, allowing an attacker
to modify this data without the consent of the key holder. This
patch does represent an on-disk format change for anyone using
project dnodes on an encrypted dataset.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7177
2018-02-20 09:41:07 -08:00
Don Brady
62d5c55313 Address objtool check failures in lua module
The use of void __attribute__((noreturn)) in kernel builds
was causing lots of warnings if CONFIG_STACK_VALIDATION
is active. For now we just remove this attribute to achieve
clean builds for the Lua module. There was no significant
increase in the time to run the full channel_program ZTS tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7173
2018-02-15 09:53:04 -08:00
George Wilson
ddc751d56b OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio
PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c
Issue #5918
Closes #7168
2018-02-14 15:30:09 -08:00
Nasf-Fan
9c5167d19f Project Quota on ZFS
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.

Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').

By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.

Support the following commands and functionalities:

zfs set projectquota@project
zfs set projectobjquota@project

zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project

zfs projectspace

zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused

zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused

chattr +/-P
chattr -p project_id
lsattr -p

This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>

For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.

Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by  Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes #6290
2018-02-13 14:54:54 -08:00
sanjeevbagewadi
918dbe35b5 mmp should use a fixed tag for spa_config locks
mmp_write_uberblock() and mmp_write_done() should the same tag
for spa_config_locks.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #6530 
Closes #7155
2018-02-12 11:30:38 -08:00
Andriy Gapon
1334283225 OpenZFS 8520 - lzc_rollback
8520 lzc_rollback_to should support rolling back to origin
7198 libzfs should gracefully handle EINVAL from lzc_rollback

lzc_rollback_to() should support rolling back to a clone's origin.
The current checks in zfs_ioc_rollback() would not allow that
because the origin snapshot belongs to a different filesystem.
The overly restrictive check was in introduced in 7600, but it
was not a regression as none of the existing tools provided a
way to rollback to the origin.

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8520
OpenZFS-issue: https://www.illumos.org/issues/7198
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/78a5a1a25a
Closes #7150
2018-02-09 10:27:58 -08:00
sanjeevbagewadi
cc63068e95 Handle zap_add() failures in mixed case mode
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #7011 
Closes #7054
2018-02-09 10:15:53 -08:00
Chunwei Chen
f108a49236 Fix zle_decompress out of bound access
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
2018-02-09 10:08:05 -08:00
Chunwei Chen
d3190c5f29 Fix zdb -c traverse stop on damaged objset root
If a corruption happens to be on a root block of an objset, zdb -c will
not correctly report the error, and it will not traverse the datasets
that come after. This is because traverse_visitbp, which does the
callback and reset error for TRAVERSE_HARD, is skipped when traversing
zil is failed in traverse_impl.

Here's example of what 'zdb -eLcc' command looks like on a pool with
damaged objset root:

== before patch:

Traversing all blocks to verify checksums ...

Error counts:

	errno  count
block traversal size 379392 != alloc 33987072 (unreachable 33607680)

	bp count:             172
	ganged count:           0
	bp logical:       1678336      avg:   9757
	bp physical:       130560      avg:    759     compression:  12.85
	bp allocated:      379392      avg:   2205     compression:   4.42
	bp deduped:             0    ref>1:      0   deduplication:   1.00
	SPA allocated:   33987072     used:  0.80%

	additional, non-pointer bps of type 0:         71
	Dittoed blocks on same vdev: 101

== after patch:

Traversing all blocks to verify checksums ...

zdb_blkptr_cb: Got error 52 reading <54, 0, -1, 0>  -- skipping

Error counts:

	errno  count
	   52  1
block traversal size 33963520 != alloc 33987072 (unreachable 23552)

	bp count:             447
	ganged count:           0
	bp logical:      36093440      avg:  80745
	bp physical:     33699840      avg:  75391     compression:   1.07
	bp allocated:    33963520      avg:  75981     compression:   1.06
	bp deduped:             0    ref>1:      0   deduplication:   1.00
	SPA allocated:   33987072     used:  0.80%

	additional, non-pointer bps of type 0:         76
	Dittoed blocks on same vdev: 115

==

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
2018-02-09 10:05:25 -08:00
Brian Behlendorf
18f57327e0 Linux 4.16 compat: inode_set_iversion()
A new interface was added to manipulate the version field of an
inode.  Add a inode_set_iversion() wrapper for older kernels and
use the new interface when available.

The i_version field was dropped from the trace point due to the
switch to an atomic64_t i_version type.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
2018-02-08 21:25:19 -08:00
Serapheim Dimitropoulos
5b72a38d68 OpenZFS 8677 - Open-Context Channel Programs
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.

=== What is implemented?

This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
  to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
  the context.
- Appropriate error handling whenever we try a channel program in
  open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

=== How do we handle zfs.sync functions in open context?

When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...

arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg

if we run it in open context, we will get back the following error:

Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
            [C]: in function 'destroy'
            [string "channel program"]:3: in main chunk

=== What about testing?

We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.

OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15
Closes #6558
2018-02-08 16:05:57 -08:00
Serapheim Dimitropoulos
8d103d8856 OpenZFS 8604 - Simplify snapshots unmounting code
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

Every time we want to unmount a snapshot (happens during snapshot
deletion or renaming) we unnecessarily iterate through all the
mountpoints in the VFS layer (see zfs_get_vfs).

The current patch completely gets rid of that code and changes
the approach while keeping the behavior of that code path the
same. Specifically, it puts a hold on the dataset/snapshot and
gets its vfs resource reference directly, instead of linearly
searching for it. If that reference exists we attempt to amount
it.

With the above change, it became obvious that the nvlist
manipulations that we do (add_boolean and add_nvlist) take a
significant amount of time ensuring uniqueness of every new
element even though they don't have too. Thus, we updated the
patch so those nvlists are not trying to enforce the uniqueness
of their elements.

A more complete analysis of the problem solved by this patch
can be found below:
https://sdimitro.github.io/post/snap-unmount-perf/

OpenZFS-issue: https://www.illumos.org/issues/8604
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb
2018-02-08 15:29:44 -08:00
Don Brady
fc5d4b6737 Increase code coverage for Lua libraries
Add test coverage for lua libraries
Remove dead code in Lua implementation

Signed-off-by: Don Brady <don.brady@delphix.com>
2018-02-08 15:29:38 -08:00
Chris Williamson
234c91c508 OpenZFS 8600 - ZFS channel programs - snapshot
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this entails extra
logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.

OpenZFS-issue: https://www.illumos.org/issues/8600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b
2018-02-08 15:29:24 -08:00
Brad Lewis
af07368986 OpenZFS 8592 - ZFS channel programs - rollback
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to perform a rollback.

OpenZFS-issue: https://www.illumos.org/issues/8592
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6
2018-02-08 15:29:14 -08:00
Chris Williamson
475eca4908 OpenZFS 8605 - zfs channel programs fix zfs.exists
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

zfs.exists() in channel programs doesn't return any result, and should
have a man page entry. This patch corrects zfs.exists so that it
returns a value indicating if the dataset exists or not. It also adds
documentation about it in the man page.

OpenZFS-issue: https://www.illumos.org/issues/8605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1e85e111
2018-02-08 15:28:52 -08:00
Chris Williamson
d99a015343 OpenZFS 7431 - ZFS Channel Programs
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533

Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
  Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
2018-02-08 15:28:18 -08:00
WHR
8824a7f133 OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable
Authored by: WHR <msl0000023508@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8966
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95549fcdc
Closes #7141
2018-02-08 10:33:45 -08:00
Richard Elling
6b810d04bd Remove deprecated zfs_arc_p_aggressive_disable
zfs_arc_p_aggressive_disable is no more. This PR removes docs
and module parameters for zfs_arc_p_aggressive_disable.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Elling <Richard.Elling@RichardElling.com>
Closes #7135
2018-02-07 11:54:20 -08:00
Brian Behlendorf
5461eefe50
Fix cstyle warnings
This patch contains no functional changes.  It is solely intended
to resolve cstyle warnings in order to facilitate moving the spl
source code in to the zfs repository.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #681
2018-02-07 11:49:38 -08:00
Tom Caputi
2b84817f66 Adjust ARC prefetch tunables to match docs
Currently, the ARC exposes 2 tunables (zfs_arc_min_prefetch_ms
and zfs_arc_min_prescient_prefetch_ms) which are documented
to be specified in milliseconds. However, the code actually
uses the values as though they were in seconds. This patch
adjusts the code to match the names and documentation of the
tunables.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7126
2018-02-05 16:57:53 -08:00
wli5
f6b58faaa6 Bug fix in qat_compress.c for vmalloc addr check
Remove the unused vmalloc address check, and function mem_to_page
will handle the non-vmalloc address when map it to a physical
address.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #7125
2018-02-05 10:26:27 -08:00
Brian Behlendorf
0d23f5e2e4
Fix hash_lock / keystore.sk_dk_lock lock inversion
The keystore.sk_dk_lock should not be held while performing I/O.
Drop the lock when reading from disk and update the code so
they the first successful caller adds the key.

Improve error handling in spa_keystore_create_mapping_impl().

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: RageLtMan <rageltman@sempervictus>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7112 
Closes #7115
2018-02-04 14:07:13 -08:00
Tom Caputi
1b66810bad Change os->os_next_write_raw to work per txg
Currently, os_next_write_raw is a single boolean used for determining
whether or not the next call to dmu_objset_sync() should write out
the objset_phys_t as a raw buffer. Since the boolean is not associated
with a txg, the work simply happens during the next txg, which is not
necessarily the correct one. In the current implementation this issue
was misdiagnosed, resulting in a small hack in dmu_objset_sync() which
seemed to resolve the problem.

This patch changes os_next_write_raw to be an array of booleans, one
for each txg in TXG_OFF and removes the hack.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
2018-02-02 11:44:53 -08:00
Tom Caputi
047116ac76 Raw sends must be able to decrease nlevels
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.

This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.

This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6821
Closes #6864
2018-02-02 11:43:11 -08:00
Tom Caputi
d53bd7f524 Fix recovery import (-F) with encrypted pool
When performing zil_claim() at pool import time, it is
important that encrypted datasets set os_next_write_raw
before writing to the zil_header_t. This prevents the code
from attempting to re-authenticate the objset_phys_t when
it writes it out, which is unnecessary because the
zil_header_t is not protected by either objset MAC and
impossible since the keys aren't loaded yet. Unfortunately,
one of the code paths did not set this flag, which causes
failed ASSERTs during 'zpool import -F'. This patch corrects
this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
Closes #6916
2018-02-02 11:39:36 -08:00
Tom Caputi
ae76f45cda Encryption Stability and On-Disk Format Fixes
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.

Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.

This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.

This patch also contains minor bug fixes and cleanups.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6845
Closes #6864
Closes #7052
2018-02-02 11:37:16 -08:00
Tom Caputi
a73c94934f Change movaps to movups in AES-NI code
Currently, the ICP contains accelerated assembly code to be
used specifically on CPUs with AES-NI enabled. This code
makes heavy use of the movaps instruction which assumes that
it will be provided aes keys that are 16 byte aligned. This
assumption seems to hold on Illumos, but on Linux some kernel
options such as 'slub_debug=P' will violate it. This patch
changes all instances of this instruction to movups which is
the same except that it can handle unaligned memory.

This patch also adds a few flags which were accidentally never
given to the assembly compiler, resulting in objtool warnings.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7065 
Closes #7108
2018-01-31 15:17:56 -08:00
Brian Behlendorf
f90a30ad1b
Fix txg_sync_thread hang in scan_exec_io()
When scn->scn_maxinflight_bytes has not been initialized it's
possible to hang on the condition variable in scan_exec_io().
This issue was uncovered by ztest and is only possible when
deduplication is enabled through the following call path.

  txg_sync_thread()
    spa_sync()
      ddt_sync_table()
        ddt_sync_entry()
          dsl_scan_ddt_entry()
            dsl_scan_scrub_cb()
              dsl_scan_enqueuei()
                scan_exec_io()
                  cv_wait()

Resolve the issue by always initializing scn_maxinflight_bytes
to a reasonable minimum value.  This value will be recalculated
in dsl_scan_sync() to pick up changes to zfs_scan_vdev_limit
and the addition/removal of vdevs.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7098
2018-01-31 09:33:33 -08:00
LOLi
63f88c12b4 Fix style issues in man pages and commands help
* Remove 'zfs snap' from zfs help message (OpenZFS sync)
* Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot'
* Enforce 80 columns limit in help messages
* Remove zfs_disable_dup_eviction from zfs-module-parameters(5)
* Expose zfs_scan_max_ext_gap as a kernel module parameter.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7087
2018-01-29 15:05:03 -08:00
Giuseppe Di Natale
5e021f56d3 Add dbuf hash and dbuf cache kstats
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.

Correct format of dbuf kstat file.

Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.

Introduce field filtering in the dbufstat python script.

Introduce a no header option to the dbufstat python script.

Introduce a test case to test basic mru->mfu list movement
in the ARC.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6906
2018-01-29 10:24:52 -08:00
Prakash Surya
0735ecb334 OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue
PROBLEM
=======

When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible
for either `ERESTART` or `EIO` to be returned.

If `ERESTART` is returned, this will cause an assertion to fail directly
in `zil_lwb_write_issue`, where the code assumes the return value is
`EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the
SPA is suspended when `dmu_tx_assign` is called, and most often occurs
when running `zloop`.

If `EIO` is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, `zil_commit_waiter_timeout` contains the
following logic:

    lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
    ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);

In this case, if `dmu_tx_assign` returned `EIO` from within
`zil_lwb_write_issue`, the `lwb` variable passed in will not be issued
to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and
this assertion will fail. `zil_commit_waiter_timeout` assumes that after
it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where `dmu_tx_assign` returns `EIO`.

SOLUTION
========

This change modifies the `dmu_tx_assign` function such that `txg_how` is
a bitmask, rather than of the `txg_how_t` enum type. Now, the previous
`TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with
specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics.

Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was
automatically invoked. This was not ideal when using `TXG_WAITED` within
`zil_lwb_write_issued`, leading the problem described above. Rather, we
want to achieve the semantics of `TXG_WAIT`, while also preventing the
`tx` from being penalized via the dirty delay throttling.

With this change, `zil_lwb_write_issued` can acheive the semtantics that
it requires by passing in the value `TXG_WAIT | TXG_NOTHROTTLE` to
`dmu_tx_assign`.

Further, consumers of `dmu_tx_assign` wishing to achieve the old
`TXG_WAITED` semantics can pass in the value `TXG_NOWAIT | TXG_NOTHROTTLE`.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE`

OpenZFS-issue: https://www.illumos.org/issues/8997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9
Closes #7084
2018-01-26 20:19:46 -08:00
Brian Behlendorf
8fb1ede146 Extend deadman logic
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems.  The proposed changes include:

* Added a new `zfs_deadman_failmode` module option which is
  used to dynamically control the behavior of the deadman.  It's
  loosely modeled after, but independant from, the pool failmode
  property.  It can be set to wait, continue, or panic.

    * wait     - Wait for the "hung" I/O (default)
    * continue - Attempt to recover from a "hung" I/O
    * panic    - Panic the system

* Added a new `zfs_deadman_ziotime_ms` module option which is
  analogous to `zfs_deadman_synctime_ms` except instead of
  applying to a pool TXG sync it applies to zio_wait().  A
  default value of 300s is used to define a "hung" zio.

* The ztest deadman thread has been re-enabled by default,
  aligned with the upstream OpenZFS code, and then extended
  to terminate the process when it takes significantly longer
  to complete than expected.

* The -G option was added to ztest to print the internal debug
  log when a fatal error is encountered.  This same option was
  previously added to zdb in commit fa603f82.  Update zloop.sh
  to unconditionally pass -G to obtain additional debugging.

* The FM_EREPORT_ZFS_DELAY event which was previously posted
  when the deadman detect a "hung" pool has been replaced by
  a new dedicated FM_EREPORT_ZFS_DEADMAN event.

* The proposed recovery logic attempts to restart a "hung"
  zio by calling zio_interrupt() on any outstanding leaf zios.
  We may want to further restrict this to zios in either the
  ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
  Calling zio_interrupt() is expected to only be useful for
  cases when an IO has been submitted to the physical device
  but for some reasonable the completion callback hasn't been
  called by the lower layers.  This shouldn't be possible but
  has been observed and may be caused by kernel/driver bugs.

* The 'zfs_deadman_synctime_ms' default value was reduced from
  1000s to 600s.

* Depending on how ztest fails there may be no cache file to
  move.  This should not be considered fatal, collect the logs
  which are available and carry on.

* Add deadman test cases for spa_deadman() and zio_wait().

* Increase default zfs_deadman_checktime_ms to 60s.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
2018-01-25 13:40:38 -08:00
Andriy Gapon
1b18c6d791 OpenZFS 8731 - ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks
Authored by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8731
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4c08500788
Closes #7079
2018-01-25 10:02:11 -08:00
Giuseppe Di Natale
cf232b53d5 Revert "Remove wrong ASSERT in annotate_ecksum"
This reverts commit 093911f194.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7079
2018-01-25 10:01:02 -08:00
Brian Behlendorf
23602fdb39
Add cv_timedwait_io()
Add missing helper function cv_timedwait_io(), it should be used
when waiting on IO with a specified timeout.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #674
2018-01-24 11:33:47 -08:00
Alexander Motin
916384729e OpenZFS 8835 - Speculative prefetch in ZFS not working for misaligned reads
In case of misaligned I/O sequential requests are not detected as such
due to overlaps in logical block sequence:

    dmu_zfetch(fffff80198dd0ae0, 27347, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27355, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27363, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27371, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27379, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27387, 9, 1)

This patch makes single block overlap to be counted as a stream hit,
improving performance up to several times.

Authored by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8835
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aab6dd482a
Closes #7062
2018-01-19 09:31:29 -08:00
Brian Behlendorf
31864e3d8c
OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL
usr/src/uts/common/sys/fs/zfs.h
	Change ZPROP_INVAL and ZPROP_CONT from macros to enum values.  Clang
	and GCC both prefer to use unsigned ints to store enums.  That was
	causing tautological comparison warnings (and likely eliminating
	error handling code at compile time) whenever a zfs_prop_t or
	zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT.  Making the
	error flags be explicity enum values forces the enum types to be
	signed.

	ZPROP_INVAL was also compared against two different enum types.  I
	had to change its name to ZPOOL_PROP_INVAL whenever its compared to
	a zpool_prop_t.  There are still some places where ZPROP_INVAL or
	ZPROP_CONT is compared to a plain int, in code that doesn't know
	whether the int is storing a zfs_prop_t or a zpool_prop_t.

usr/src/uts/common/fs/zfs/spa.c
	s/ZPROP_INVAL/ZPOOL_PROP_INVAL/

Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74
Closes #7061
2018-01-19 09:22:37 -08:00
John L. Hammond
51d1b58ef3 Emit an error message before MMP suspends pool
In mmp_thread(), emit an MMP specific error message before calling
zio_suspend() so that the administrator will understand why the pool
is being suspended.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John L. Hammond <john.hammond@intel.com>
Closes #7048
2018-01-17 12:24:42 -08:00
Sean Eric Fagan
43cb30b3ce OpenZFS 8959 - Add notifications when a scrub is paused or resumed
Authored by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
- Brought #defines in eventdefs.h in line with ZFS on Linux format.
- Updated zfs-events.5 with the new events.

OpenZFS-issue: https://www.illumos.org/issues/8959
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eea
Closes #7049
2018-01-17 10:31:00 -08:00
Andriy Gapon
6a2185660d OpenZFS 8930 - zfs_zinactive: do not remove the node if the filesystem is readonly
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8930
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/93c618e0f4
Closes #7029
2018-01-11 13:50:08 -08:00
Brian Behlendorf
fed90353d7
Support -fsanitize=address with --enable-asan
When --enable-asan is provided to configure then build all user
space components with fsanitize=address.  For kernel support
use the Linux KASAN feature instead.

https://github.com/google/sanitizers/wiki/AddressSanitizer

When using gcc version 4.8 any test case which intentionally
generates a core dump will fail when using --enable-asan.
The default behavior is to disable core dumps and only newer
versions allow this behavior to be controled at run time with
the ASAN_OPTIONS environment variable.

Additionally, this patch includes some build system cleanup.

* Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS,
  and AM_LDFLAGS.  Any additional flags should be added on a
  per-Makefile basic.  The --enable-debug and --enable-asan
  options apply to all user space binaries and libraries.

* Compiler checks consolidated in always-compiler-options.m4
  and renamed for consistency.

* -fstack-check compiler flag was removed, this functionality
  is provided by asan when configured with --enable-asan.

* Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and
  DEBUG_LDFLAGS.

* Moved default kernel build flags in to module/Makefile.in and
  split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS.  These
  flags are set with the standard ccflags-y kbuild mechanism.

* -Wframe-larger-than checks applied only to binaries or
  libraries which include source files which are built in
  both user space and kernel space.  This restriction is
  relaxed for user space only utilities.

* -Wno-unused-but-set-variable applied only to libzfs and
  libzpool.  The remaining warnings are the result of an
  ASSERT using a variable when is always declared.

* -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped
  because they are Solaris specific and thus not needed.

* Ensure $GDB is defined as gdb by default in zloop.sh.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7027
2018-01-10 10:49:27 -08:00
Alex Zhuravlev
3910184d9e Use zap_count instead of cached z_size for unlink
As a performance optimization Lustre does not strictly update
the SA_ZPL_SIZE when adding/removing from non-directory entries.
This results in entries which cannot be removed through the ZPL
layer even though the ZAP is empty and safe to remove.

Resolve this issue by checking the zap_count() directly instead
on relying on the cached SA_ZPL_SIZE.  Micro-benchmarks show no
significant performance impact due to the additional overhead
of using zap_count().

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7019
2018-01-09 16:16:07 -08:00
DHE
460f239e69 Fix -fsanitize=address memory leak
kmem_alloc(0, ...) in userspace returns a leakable pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #6941
2018-01-09 12:26:25 -08:00
Brian Behlendorf
0873bb6337
Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified.  As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf.  Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list.  This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6171 
Closes #6852 
Closes #6989
2018-01-08 09:52:36 -08:00
Prakash Surya
2fe61a7ecc OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
2017-12-28 10:18:04 -08:00
lidongyang
823d48bfb1 Call commit callbacks from the tail of the list
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes #6986
2017-12-22 10:19:51 -08:00
Tony Hutter
c9821f1ccc Linux 4.15 compat: timer updates
Use timer_setup() macro and new timeout function definition.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #670
Closes #671
2017-12-21 10:56:32 -08:00
Tom Caputi
a8b2e30685 Support re-prioritizing asynchronous prefetches
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.

This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6921 
Closes #6926
2017-12-21 09:13:06 -08:00
Brian Behlendorf
bbffb59efc
Fix multihost stale cache file import
When the multihost property is enabled it should be impossible to
import an active pool even using the force (-f) option.  This patch
prevents a forced import from succeeding when importing with a
stale cache file.

The root cause of the problem is that the kernel modules trusted
the hostid provided in configuration.  This is always correct when
the configuration is generated by scanning for the pool.  However,
when using an existing cache file the hostid could be stale which
would result in the activity check being skipped.

Resolve the issue by always using the hostid read from the label
configuration where the best uberblock was found.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6933 
Closes #6971
2017-12-18 10:28:27 -08:00
Brian Behlendorf
c28a67733c
Suppress incorrect objtool warnings
Suppress incorrect warnings from versions of objtool which are not
aware of x86 EVEX prefix instructions used for AVX512.

  module/zfs/vdev_raidz_math_avx512bw.o: warning:
  objtool: <func+offset>: can't find jump dest instruction at .text

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6928
2017-12-07 10:28:50 -08:00
Prakash Surya
1b2b0acab5 OpenZFS 8603 - rename zilog's "zl_writer_lock" to "zl_issuer_lock"
This is a purely cosmetic change. The zilog's "zl_writer_lock" field is
being renamed to "zl_issuer_lock" to try and make the code easier to
understand; no other changes are made.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8603
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2daf06546b
Closes #6927
2017-12-06 11:38:10 -08:00
Prakash Surya
1ce23dcaff OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 09:39:16 -08:00
Brian Behlendorf
7b3407003f
Fix NFS sticky bit permission denied error
When zfs_sticky_remove_access() was originally adapted for Linux
a typo was made which altered the intended behavior.  As described
in the block comment, the intended behavior is that permission
should be granted when the entry is a regular file and you have
write access.  That is, S_ISREG should have been used instead of
S_ISDIR.

Restricting permission to regular files made good sense for older
systems where setting the bit on executable files would instruct
the system to save the program's text segment on the swap device.

On modern systems this behavior has been replaced by the sticky
bit acting as a restricted deletion flag and the plain file
restriction has been relaxed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6889 
Closes #6910
2017-12-04 11:55:57 -08:00
Brian Behlendorf
72841b9fd9
Preserve itx alloc size for zio_data_buf_free()
Using zio_data_buf_alloc() to allocate the itx's may be unsafe
because the itx->itx_lr.lrc_reclen field is not constant from
allocation to free.  Using a different itx->itx_lr.lrc_reclen
size in zio_data_buf_free() can result in the allocation being
returned to the wrong kmem cache.

This issue can be avoided entirely by storing the allocation size
in itx->itx_size and using that for zio_data_buf_free().

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6912
2017-12-04 11:44:39 -08:00
Tom Caputi
d4677269f2 Unbreak the scan status ABI
When d4a72f23 was merged, pss_pass_issued was incorrectly
added to the middle of the pool_scan_stat_t structure
instead of the end. This patch simply corrects this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6909
2017-11-30 09:40:13 -08:00
LOLi
ed15d54481 Fix 'zfs get {user|group}objused@' functionality
Fix a regression accidentally introduced in 1b81ab4 that prevents
'zfs get {user|group}objused@' from correctly reporting the requested
value.

Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify
this functionality.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6908
2017-11-29 11:59:22 -08:00
Mark Wright
56d8d8ace4 Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT
Fix build errors with gcc 7.2.0 on Gentoo with kernel 4.14
built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as:

module/nvpair/nvpair.c:2810:2:error:
positional initialization of field in ?struct? declared with
'designated_init' attribute [-Werror=designated-init]
  nvs_native_nvlist,
  ^~~~~~~~~~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Wright <gienah@gentoo.org>
Closes #5390 
Closes #6903
2017-11-28 17:33:48 -06:00
Brian Behlendorf
94183a9d8a
Update for cppcheck v1.80
Resolve new warnings and errors from cppcheck v1.80.

* [lib/libshare/libshare.c:543]: (warning)
  Possible null pointer dereference: protocol
* [lib/libzfs/libzfs_dataset.c:2323]: (warning)
  Possible null pointer dereference: srctype
* [lib/libzfs/libzfs_import.c:318]: (error)
  Uninitialized variable: link
* [module/zfs/abd.c:353]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:353]: (error) Uninitialized variable: i
* [module/zfs/abd.c:385]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:385]: (error) Uninitialized variable: i
* [module/zfs/abd.c:553]: (error) Uninitialized variable: i
* [module/zfs/abd.c:553]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:763]: (error) Uninitialized variable: i
* [module/zfs/abd.c:763]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page
* [module/zfs/zpl_xattr.c:342]: (warning)
   Possible null pointer dereference: value
* [module/zfs/zvol.c:208]: (error) Uninitialized variable: p

Convert the following suppression to inline.

* [module/zfs/zfs_vnops.c:840]: (error)
  Possible null pointer dereference: aiov

Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since
these macro's will never be defined until this functionality
is implemented.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6879
2017-11-18 14:08:00 -08:00
DeHackEd
da5d4697a8 Fix ARC pointer overrun
Only access the `b_crypt_hdr` field of an ARC header if the content
is encrypted.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #6877
2017-11-17 15:11:39 -08:00
Tom Caputi
d4a72f2386 Sequential scrub and resilvers
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #3625 
Closes #6256
2017-11-15 17:27:01 -08:00
Brian Behlendorf
ed19bccfb6
Linux 4.14 compat: vfs_read & vfs_write
The kernel_read & kernel_write functions have always wrapped the
vfs_read & vfs_write functions respectively.  However, they could
not be used by vn_rdwr() since the offset wasn't passed as a
pointer.  This prevented us from being able to properly update
the file offset.

Linux 4.14 unexported vfs_read & vfs_write but also changed the
signature of kernel_read & kernel_write to provide the needed
functionality.  Use these updated functions when available.

Reviewed-by: Pritam Baral <pritam@pritambaral.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #656 
Closes #667
2017-11-15 17:19:23 -08:00
Brian Behlendorf
454365bbaa
Fix dirty check in dmu_offset_next()
The correct way to determine if a dnode is dirty is to check
if any of the dn->dn_dirty_link's are active.  Relying solely
on the dn->dn_dirtyctx can result in the dnode being mistakenly
reported as clean.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3125 
Closes #6867
2017-11-15 10:19:32 -08:00
LOLi
99834d1950 Fix truncate(2) mtime and ctime handling
On Linux, ftruncate(2) always changes the file timestamps, even if the
file size is not changed. However, in case of a successfull
truncate(2), the timestamps are updated only if the file size changes.
This translates to the VFS calling the ZFS Posix Layer "setattr"
function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally
set on the iattr mask only when doing a ftruncate(2), while the
truncate(2) is left to the filesystem implementation to be dealt with.

This behaviour is consistent with POSIX:2004/SUSv3 specifications
where there's no explicit requirement for file size changes to update
the timestamps only for ftruncate(2):

http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html
http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html

This has been later updated in POSIX:2008/SUSv4 where, for both
truncate(2)/ftruncate(2), there's no mention of this size change
requirement:

http://austingroupbugs.net/view.php?id=489
http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

Unfortunately the Linux VFS is still calling into the ZPL without
ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by
explicitly updating the timestamps when detecting the ATTR_SIZE bit,
which is always set in do_truncate(), on the iattr mask.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6811 
Closes #6819
2017-11-13 09:24:26 -08:00
benrubson
7c351e31d5 OpenZFS 7531 - Assign correct flags to prefetched buffers
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: abraunegg <alex.braunegg@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/468008cb
2017-11-11 20:24:34 -08:00
Arkadiusz Bubała
c0daec32f8 Long hold the dataset during upgrade
If the receive or rollback is performed while filesystem is upgrading
the objset may be evicted in `dsl_dataset_clone_swap_sync_impl`. This
will lead to NULL pointer dereference when upgrade tries to access
evicted objset.

This commit adds long hold of dataset during whole upgrade process.
The receive and rollback will return an EBUSY error until the
upgrade is not finished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #5295 
Closes #6837
2017-11-10 13:37:10 -08:00
Tom Caputi
62df1bc813 Fix encryption root hierarchy issue
After doing a recursive raw receive, zfs userspace performs
a final pass to adjust the encryption root hierarchy as
needed. Unfortunately, the FORCE_INHERIT ioctl had a bug
which caused the encryption root to always be assigned to
the direct parent instead of the inheriting parent. This
patch simply fixes this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6847 
Closes #6848
2017-11-08 15:25:30 -08:00
Tim Chase
71a24c3c52 Handle compressed buffers in __dbuf_hold_impl()
In __dbuf_hold_impl(), if a buffer is currently syncing and is still
referenced from db_data, a copy is made in case it is dirtied again in
the txg.  Previously, the buffer for the copy was simply allocated with
arc_alloc_buf() which doesn't handle compressed or encrypted buffers
(which are a special case of a compressed buffer).  The result was
typically an invalid memory access because the newly-allocated buffer
was of the uncompressed size.

This commit fixes the problem by handling the 2 compressed cases,
encrypted and unencrypted, respectively, with arc_alloc_raw_buf() and
arc_alloc_compressed_buf().

Although using the proper allocation functions fixes the invalid memory
access by allocating a buffer of the compressed size, another unrelated
issue made it impossible to properly detect compressed buffers in the
first place.  The header's compression flag was set to ZIO_COMPRESS_OFF
in arc_write() when it was possible that an attached buffer was actually
compressed.  This commit adds logic to only set ZIO_COMPRESS_OFF in
the non-ZIO_RAW case which wil handle both cases of compressed buffers
(encrypted or unencrypted).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5742 
Closes #6797
2017-11-08 13:32:15 -08:00
Brian Behlendorf
d8fdfc2d65
OpenZFS 8607 - variable set but not used
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Toomas Soome <tsoome@me.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8607
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b852c2f5
Closes #6842
2017-11-08 09:09:45 -08:00
wli5
a3df7fa79d Bug fix in qat_compress.c when compressed size is < 4KB
When the 128KB block is compressed to less than 4KB, the pointer
to the Footer is not in the end of the compressed buffer, that's
because the Header offset was added twice for this case. So there
is a gap between the Footer and the compressed buffer.
1. Always compute the Footer pointer address from the start of the
last page.
2. Remove the un-used workaroud code which has been verified fixed
with the latest driver and this fix.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #6827
2017-11-07 14:51:30 -08:00
Don Brady
31df97cdab Build regression in c89 cleanups
Fixed build regression in non-debug builds from recent cleanups of
c89 workarounds.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #6832
2017-11-07 10:42:15 -08:00
Don Brady
1c27024e22 Undo c89 workarounds to match with upstream
With PR 5756 the zfs module now supports c99 and the
remaining past c89 workarounds can be undone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #6816
2017-11-04 13:25:13 -07:00
James Cowgill
35a44fcb8d Remove all spin_is_locked calls
On systems with CONFIG_SMP turned off, spin_is_locked always returns
false causing these assertions to fail. Remove them as suggested in
zfsonlinux/zfs#6558.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: James Cowgill <james.cowgill@mips.com>
Closes #665
2017-10-30 11:16:56 -07:00
Brian Behlendorf
8be3688999
Remove vn_rename and vn_remove
Both vn_rename and vn_remove have been historically problematic
to implement reliably.  Rather than fixing them yet again they
are being removed.

Reviewed-by: Arkadiusz Bubala <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648 
Closes #661
2017-10-27 15:49:14 -07:00
Brian Behlendorf
867959b588
OpenZFS 8081 - Compiler warnings in zdb
Fix compiler warnings in zdb.  With these changes, FreeBSD can compile
zdb with all compiler warnings enabled save -Wunused-parameter.

usr/src/cmd/zdb/zdb.c
usr/src/cmd/zdb/zdb_il.c
usr/src/uts/common/fs/zfs/sys/sa.h
usr/src/uts/common/fs/zfs/sys/spa.h
	Fix numerous warnings, including:
	* const-correctness
	* shadowing global definitions
	* signed vs unsigned comparisons
	* missing prototypes, or missing static declarations
	* unused variables and functions
	* Unreadable array initializations
	* Missing struct initializers

usr/src/cmd/zdb/zdb.h
	Add a header file to declare common symbols

usr/src/lib/libzpool/common/sys/zfs_context.h
usr/src/uts/common/fs/zfs/arc.c
usr/src/uts/common/fs/zfs/dbuf.c
usr/src/uts/common/fs/zfs/spa.c
usr/src/uts/common/fs/zfs/txg.c
	Add a function prototype for zk_thread_create, and ensure that every
	callback supplied to this function actually matches the prototype.

usr/src/cmd/ztest/ztest.c
usr/src/uts/common/fs/zfs/sys/zil.h
usr/src/uts/common/fs/zfs/zfs_replay.c
usr/src/uts/common/fs/zfs/zvol.c
	Add a function prototype for zil_replay_func_t, and ensure that
	every function of this type actually matches the prototype.

usr/src/uts/common/fs/zfs/sys/refcount.h
	Change FTAG so it discards any constness of __func__, necessary
	since existing APIs expect it passed as void *.

Porting Notes:
- Many of these fixes have already been applied to Linux.  For
  consistency the OpenZFS version of a change was applied if the
  warning was addressed in an equivalent but different fashion.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8081
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8a
Closes #6787
2017-10-27 12:46:35 -07:00
LOLi
ee45fbd894 ZFS send fails to dump objects larger than 128PiB
When dumping objects larger than 128PiB it's possible for do_dump() to
miscalculate the FREE_RECORD offset due to an integer overflow
condition: this prevents the receiving end from correctly restoring
the dumped object.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6760
2017-10-26 16:58:38 -07:00
Brian Behlendorf
a032ac4b38 OpenZFS 8558, 8602 - lwp_create() returns EAGAIN
8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems

On a system with more than 80K ZFS filesystems, we've seen cases
where lwp_create() will start to fail by returning EAGAIN. The
problem being, for each of those 80K ZFS filesystems, a taskq will
be created for each dataset as part of the ZIL for each dataset.

Porting Notes:
- The new nomem taskq kstat was dropped.
- Added module options and documentation for new tunings
  zfs_zil_clean_taskq_nthr_pct, zfs_zil_clean_taskq_minalloc,
  zfs_zil_clean_taskq_maxalloc, and zfs_sync_taskq_batch_pct.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8558
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/216d772

8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8602
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2bcb545
Closes #6779
2017-10-26 12:57:53 -07:00
Arkadiusz Bubała
d3f2cd7e3b Added no_scrub_restart flag to zpool reopen
Added -n flag to zpool reopen that allows a running scrub
operation to continue if there is a device with Dirty Time Log.

By default if a component device has a DTL and zpool reopen
is executed all running scan operations will be restarted.

Added functional tests for `zpool reopen`

Tests covers following scenarios:
* `zpool reopen` without arguments,
* `zpool reopen` with pool name as argument,
* `zpool reopen` while scrubbing,
* `zpool reopen -n` while scrubbing,
* `zpool reopen -n` while resilvering,
* `zpool reopen` with bad arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #6076 
Closes #6746
2017-10-26 12:26:09 -07:00
Brian Behlendorf
d5e024cba2 Emit history events for 'zpool create'
History commands and events were being suppressed for the
'zpool create' command since the history object did not
yet exist.  Create the object earlier so this history
doesn't get lost.

Split the pool_destroy event in to pool_destroy and
pool_export so they may be distinguished.

Updated events_001_pos and events_002_pos test cases.  They
now check for the expected history events and were reworked
to be more reliable.

Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6712 
Closes #6486
2017-10-23 09:45:59 -07:00
wli5
1cfdb0e6e4 Support integration with new QAT products
Support integration with new QAT products: Intel(R) C62x Chipset,
or Atom(R) C3000 Processor Product Family SoC:
1. Detect new file name in auto-conf.
2. Change MAX_INSTANCES to 48.
3. Change "num_inst" to U16 to clean a build warning.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #6767
2017-10-20 11:11:25 -07:00
Brian Behlendorf
bbf1ad67cd Remove vn_rename and vn_remove dependency
The only place vn_rename and vn_remove are used is when writing
out an updated pool configuration file.  By truncating the file
instead of renaming and removing it we can avoid having to implement
these interfaces entirely.  Functionally an empty cache file is
treated the same as a missing cache file.  This is particularly
advantageous because the Linux kernel has never provided a way
to reliably implement vn_rename and vn_remove.

The cachefile_004_pos.ksh test case was updated to understand
that an empty cache file is the same as a missing one.

The zfs-import-* systemd service files were not updated to use
ConditionFileNotEmpty in place of ConditionPathExists.  This
means that after exporting all pools and rebooting new pools
will not the scanned for on the next boot.  This small change
should not impact normal usage since pools are not exported
as part of a normal shutdown.

Documentation was updated accordingly.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#648 
Closes #6753
2017-10-19 10:06:55 -07:00
Tom Caputi
35df0bb556 Fix ASSERT in dmu_free_long_object_raw()
This small patch fixes an issue where dmu_free_long_object_raw()
calls dnode_hold() after freeing the dnode a line above.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6766
2017-10-18 10:08:36 -07:00
Brian Behlendorf
21a932b83c Post-Encryption Followup
This PR includes fixes for bugs and documentation issues found 
after the encryption patch was merged and general code improvements 
for long-term maintainability.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6526
Closes #6639
Closes #6703
Cloese #6706
Closes #6714
Closes #6595
2017-10-13 10:02:39 -07:00
Tom Caputi
9bae371ce6 Fix for #6714
This 2 line patch fixes a possible integer overflow reported by grsec.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:59:42 -04:00
Tom Caputi
2637dda8f8 Fix for #6706
This patch resolves an issue where raw sends would fail to send
encryption parameters if the wrapping key was unloaded and reloaded
before the data was sent and the dataset wass not an encryption root.
The code attempted to lookup the values from the wrapping key which
was not being initialized upon reload. This change forces the code to
lookup the correct value from the encryption root's DSL Crypto Key.
Unfortunately, this issue led to the on-disk DSL Crypto Key for some
non-encryption root datasets being left with zeroed out encryption
parameters. However, this should not present a problem since these
values are never looked at and are overrwritten upon changing keys.

This patch also fixes an issue where raw, resumable sends were not
being cleaned up appropriately if an invalid DSL Crypto Key was
received.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:58:39 -04:00
Tom Caputi
b135b9f11a Fix for #6703
This patch resolves an issue where spa_keystore_change_key_sync_impl()
incorrectly recursed into clone DSL Directories while recursively
rewrapping encryption keys. Clones share keys with their origins, so
this logic was incorrect.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:57:22 -04:00
Tom Caputi
440a3eb939 Fixes for #6639
Several issues were uncovered by running stress tests with zfs
encryption and raw sends in particular. The issues and their
associated fixes are as follows:

* arc_read_done() has the ability to chain several requests for
  the same block of data via the arc_callback_t struct. In these
  cases, the ARC would only use the first request's dsobj from
  the bookmark to decrypt the data. This is problematic because
  the first request might be a prefetch zio which is able to
  handle the key not being loaded, while the second might use a
  different key that it is sure will work. The fix here is to
  pass the dsobj with each individual arc_callback_t so that each
  request can attempt to decrypt the data separately.

* DRR_FREE and DRR_FREEOBJECT records in a send file were not
  having their transactions properly tagged as raw during raw
  sends, which caused a panic when the dbuf code attempted to
  decrypt these blocks.

* traverse_prefetch_metadata() did not properly set
  ZIO_FLAG_SPECULATIVE when issuing prefetch IOs.

* Added a few asserts and code cleanups to ensure these issues
  are more detectable in the future.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:55:50 -04:00
Tom Caputi
4807c0badb Encryption patch follow-up
* PBKDF2 implementation changed to OpenSSL implementation.

* HKDF implementation moved to its own file and tests
  added to ensure correctness.

* Removed libzfs's now unnecessary dependency on libzpool
  and libicp.

* Ztest can now create and test encrypted datasets. This is
  currently disabled until issue #6526 is resolved, but
  otherwise functions as advertised.

* Several small bug fixes discovered after enabling ztest
  to run on encrypted datasets.

* Fixed coverity defects added by the encryption patch.

* Updated man pages for encrypted send / receive behavior.

* Fixed a bug where encrypted datasets could receive
  DRR_WRITE_EMBEDDED records.

* Minor code cleanups / consolidation.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:54:48 -04:00
Tom Caputi
94d49e8f9b Relax ASSERT for #6526
This patch resolves a minor issue where an ASSERT in
metaslab_passivate() that only applies to non weight-based
metaslabs was erroneously applied to all metaslabs.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:53:37 -04:00
KireinaHoro
a7ec8c47e2
SPARC optimizations for Encode()
Normally a SPARC processor runs in big endian mode. Save the extra labor
needed for little endian machines when the target is a big endian one
(sparc).

Signed-off-by: Pengcheng Xu <i@jsteward.moe>
2017-10-12 01:36:16 +08:00
KireinaHoro
46d4fe880e
SPARC optimizations for SHA1Transform()
Passing arguments explicitly into SHA1Transform() increases the number of
registers abailable to the compiler, hence leaving more local and out registers
available. The missing symbol of sha1_consts[], which prevents compiling on
SPARC, is added back, which speeds up the process of utilizing the relative
constants.
This should fix #6738.

Signed-off-by: Pengcheng Xu <i@jsteward.moe>
2017-10-12 01:36:11 +08:00
Tobin Harding
523d5ce0f4 Fix coverity defects: CID 147474
CID 147474: Logically dead code (DEADCODE)

Remove ternary operator and return `error` directly.

Currently return value is derived from a ternary operator. The
conditional is always true. The ternary operator is therefore
redundant i.e dead code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6723
2017-10-10 16:41:47 -07:00
Fabian Grünbichler
829e95c4dc Skip FREEOBJECTS for objects which can't exist
When sending an incremental stream based on a snapshot, the receiving
side must have the same base snapshot.  Thus we do not need to send
FREEOBJECTS records for any objects past the maximum one which exists
locally.

This allows us to send incremental streams (again) to older ZFS
implementations (e.g. ZoL < 0.7) which actually try to free all objects
in a FREEOBJECTS record, instead of bailing out early.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #5699
Closes #6507
Closes #6616
2017-10-10 15:35:49 -07:00
Fabian Grünbichler
48fbb9ddbf Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to
exist after receiving the stream.  Free them accordingly, as if a
freeobjects record for them had been included in the stream.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #5699
Closes #6507
Closes #6616
2017-10-10 15:30:51 -07:00
Brian Behlendorf
70f02287f8 Fix ARC behavior on 32-bit systems
With the addition of the ABD changes consumption of the virtual
address space has been greatly reduced.  This exposed an issue on
CONFIG_HIGHMEM systems where free memory was being calculated
incorrectly.  Functionally this didn't cause any major problems
prior to ABD because a lack of available virtual address space
was used as an indicator of low memory.

This patch makes the following changes to address the issue and
in the process realigns the code further with OpenZFS.  There
are no substantive changes in behavior for 64-bit systems.

* Added CONFIG_HIGHMEM case to the arc_all_memory() and
  arc_free_memory() functions to only consider low memory pages
  on CONFIG_HIGHMEM systems.

* The arc_free_memory() function was updated to return bytes
  instead of pages to be consistent with the other helper
  functions.  In user space we make up some reasonable values
  since currently only testing is performed in this context.

* Adds three new values to the arcstats kstat to provide visibility
  in to the ARC's assessment of the memory situation:
  memory_all_bytes, memory_free_bytes, and memory_available_bytes.

* Added kmem_reap() call to arc_available_memory() for 32-bit
  builds to realign code with OpenZFS.

* Reduced size of test file in /async_destroy_001_pos.ksh to
  speed up test case.  Multiple txgs are still required.

* Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos
  to TEST_BASE_DIR location to speed up test cases.

Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5352
Closes #6734
2017-10-10 15:19:19 -07:00
Olaf Faaland
4b393c50ae Make file headers conform to ZFS style standard
No semantic changes.

Change
 /************\
and
 \************/

to

 /*
and
  */

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-10-09 14:27:27 -07:00
Brian Behlendorf
57f4ef2e81 Fix abdstats kstat on 32-bit systems
When decrementing the struct_size and scatter_chunk_waste kstats
the value needs to be cast to an int on 32-bit systems.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6721
2017-10-06 11:23:12 -07:00
Tobin Harding
d95a59805f Remove unnecessary equality check
Currently `if` statement includes an assignment (from a function return
value) and a equality check. The parenthesis are in the incorrect place,
currently the code clobbers the function return value because of this.

We can fix this by simplifying the `if` statement.

`if (foo != 0)`

can be more succinctly expressed as

`if (foo)`

Remove the equality check, add parenthesis to correct the statement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6685 
Close #6719
2017-10-05 19:33:44 -07:00
Isaac Huang
eea2e24132 Use linear abd in vdev_copy_uberblocks()
The vdev_copy_uberblocks() function should use abd_alloc_linear() to
allocate ub_abd, because abd_to_buf(ub_abd)) is used later.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #6718 
Closes #6713
2017-10-05 19:30:02 -07:00
Brian Behlendorf
c11f1004d1 Remove dead code from AVL tree
The avl_update_* functions are never used by ZFS and are therefore
being removed.  They're barely even used in Illumos.  Additionally,
simplify avl_add() by using a VERIFY which produces exactly the same
behavior under Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6716
2017-10-05 19:28:00 -07:00
Ned Bass
39f56627ae receive_freeobjects() skips freeing some objects
When receiving a FREEOBJECTS record, receive_freeobjects()
incorrectly skips a freed object in some cases. Specifically, this
happens when the first object in the range to be freed doesn't exist,
but the second object does. This leaves an object allocated on disk
on the receiving side which is unallocated on the sending side, which
may cause receiving subsequent incremental streams to fail.

The bug was caused by an incorrect increment of the object index
variable when current object being freed doesn't exist.  The
increment is incorrect because incrementing the object index is
handled by a call to dmu_object_next() in the increment portion of
the for loop statement.

Add test case that exposes this bug.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6694 
Closes #6695
2017-10-02 15:36:04 -07:00
Alek P
01ff0d7540 Update the default for zfs_txg_history
It's often useful to have access to txg history for debugging
purposes. This patch changes the default from 0 to 100 TXGs
worth of history preserved.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6691
2017-09-29 15:58:52 -07:00
chrisrd
e71cade67d Scale the dbuf cache with arc_c
Commit d3c2ae1 introduced a dbuf cache with a default size of the
minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted
using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache
is counted as metadata for the purposes of ARC size calculations.

On a 1GB box the ARC maximum size defaults to c_max 493M which gives a
dbuf cache default minimum size of 15.4M, and the ARC metadata defaults
to minimum 16M. I.e. the dbuf cache is an significant proportion of the
minimum metadata size. With other overheads involved this actually means
the ARC metadata doesn't get down to the minimum.

This patch dynamically scales the dbuf cache to the target ARC size
instead of statically scaling it to the maximum ARC size. (The scale is
still set by dbuf_cache_max_shift and the maximum size is still fixed by
dbuf_cache_max_bytes.) Using the target ARC size rather than the current
ARC size is done to help the ARC reach the target rather than simply
focusing on the current size.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #6506 
Closes #6561
2017-09-29 15:49:19 -07:00
DeHackEd
7e98073379 Fix printk() calls missing log level
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #6672
2017-09-25 10:38:27 -07:00
Richard Elling
e8474f9ad3 Pool io stat shows wlentime instead of rlentime
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #652 
Closes #651
2017-09-25 10:02:24 -07:00
Olaf Faaland
d410c6d9fd Reimplement vdev_random_leaf and rename it
Rename it as mmp_random_leaf() since it is defined in mmp.c.

The earlier implementation could end up spinning forever if a pool had a
vdev marked writeable, none of whose children were writeable.  It also
did not guarantee that if a writeable leaf vdev existed, it would be
found.

Reimplement to recursively walk the device tree to select the leaf.  It
searches the entire tree, so that a return value of (NULL) indicates
there were no usable leaves in the pool; all were either not writeable
or had pending mmp writes.

It still chooses the starting child randomly at each level of the tree,
so if the pool's devices are healthy, the mmp writes go to random leaves
with an even distribution.  This was verified by testing using
zfs_multihost_history enabled.

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6631 
Closes #6665
2017-09-22 14:29:26 -07:00
Brian Behlendorf
4ce3c45a5e Increase default arc_c_min
Increase the default arc_c_min value to which whichever is larger,
either 32M or 1/32 of total system memory.  This is advantageous for
systems with more than 1G of memory where performance issues may
occur when the ARC is allowed to collapse below a minimum size.
At the same time we want to use the bare minimum value which is
still functional so the filesystem can be used in very low memory
environments.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6659
2017-09-20 09:36:17 -07:00
Brian Behlendorf
848259c10f Export symbol dmu_tx_mark_netfree()
This symbol is needed by Lustre for the same reason it was needed
by the ZPL.  It should have been exported when the original patch
was merged.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6660
2017-09-20 09:30:24 -07:00
Feng Sun
18a2485fc8 misc: fix meaningless values
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #6658
2017-09-19 12:19:08 -07:00
Giuseppe Di Natale
34d00e7aba Correct cppcheck errors
ZFS buildbot STYLE builder was moved to Ubuntu 17.04
which has a newer version of cppcheck. Handle the
new cppcheck errors.

uu_* functions removed in this commit were unused
and effectively dead code. They are now retired.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6653
2017-09-19 12:17:29 -07:00
Giuseppe Di Natale
787acae0b5 Linux 3.14 compat: IO acct, global_page_state, etc
generic_start_io_acct/generic_end_io_acct in the master
branch of the linux kernel requires that the request_queue
be provided.

Move the logic from freemem in the spl to arc_free_memory
in arc.c. Do this so we can take advantage of global_page_state
interface checks in zfs.

Upstream kernel replaced struct block_device with
struct gendisk in struct bio. Determine if the
function bio_set_dev exists during configure
and have zfs use that if it exists.

bio_set_dev https://github.com/torvalds/linux/commit/74d4699
global_node_page_state https://github.com/torvalds/linux/commit/75ef718
io acct https://github.com/torvalds/linux/commit/d62e26b

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6635
2017-09-16 11:00:19 -07:00
Gaurav Kumar
0107f69898 Modifying XATTRs doesnt change the ctime
Changing any metadata, should modify the ctime.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #3644 
Closes #6586
2017-09-13 12:20:07 -07:00
Arkadiusz Bubała
d9549cba96 Fix false config_cache_write events
On pool import when the old cache file is removed
the ereport.fs.zfs.config_cache_write event is generated.
Because zpool export always removes cache file it happens
every export - import sequence.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #6617
2017-09-11 10:25:01 -07:00
Brian Behlendorf
5c214ae318 Fix volume WR_INDIRECT log replay
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6603 
Closes #6615
2017-09-08 15:07:00 -07:00
Brian Behlendorf
e0dd0a32a8 Revert "Handle new dnode size in incremental..."
This reverts commit 65dcb0f67a until
a comprehensive fix is finalized.  The stricter interior dnode
detection in 4c5b89f59e and the new
test case added by this patch revealed a issue with resizing
dnodes when receiving an incremental backup stream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6576
2017-09-07 10:00:54 -07:00
Olaf Faaland
4c5b89f59e Improved dnode allocation and dmu_hold_impl()
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.

* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately.  This
eliminates most lock contention when taking the hold and creating the
dnode.

* Correct detection of the chunk boundary to work properly with large
dnodes.

* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.

* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk.  Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.

* Add several kstats to allow the behavior of the code to be examined.

* Verify dnode packing in large_dnode_008_pos.ksh.  Since the test is
purely creates, it should leave very few holes in the metadnode.

* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.

With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.

name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    3804690
dnode_hold_alloc_misses         4    216
dnode_hold_alloc_interior       4    3
dnode_hold_alloc_lock_retry     4    0
dnode_hold_alloc_lock_misses    4    0
dnode_hold_alloc_type_none      4    0
dnode_hold_free_hits            4    203105
dnode_hold_free_misses          4    4
dnode_hold_free_lock_misses     4    0
dnode_hold_free_lock_retry      4    0
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    57
dnode_hold_free_txg             4    0
dnode_allocate                  4    203154
dnode_reallocate                4    0
dnode_buf_evict                 4    23918
dnode_alloc_next_chunk          4    4887
dnode_alloc_race                4    0
dnode_alloc_next_block          4    18

The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5396 
Closes #6522 
Closes #6414 
Closes #6564
2017-09-05 16:15:04 -07:00
Ned Bass
65dcb0f67a Handle new dnode size in incremental backup stream
When receiving an incremental backup stream, call
dmu_object_reclaim_dnsize() if an object's dnode size differs between
the incremental source and target. Otherwise it may appear that a
dnode which has shrunk is still occupying slots which are in fact
free. This will cause a failure to receive new objects that should
occupy the now-free slots.

Add a test case to verify that an incremental stream containing
objects with changed dnode sizes can be received without error. This
test case fails without this change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6366 
Closes #6576
2017-09-05 16:09:15 -07:00
Brian Behlendorf
e771de534f Trim new line from zfs_vdev_scheduler
Add a helper function to trim the tailing new line.  While we're
here use this new hook to immediately apply the new scheduler.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3356 
Closes #6573
2017-09-05 13:41:32 -07:00
LOLi
cf7684bc8d Retire send space estimation via ZFS_IOC_SEND
Add a small wrapper around libzfs_core`lzc_send_space() to libzfs so
that every legacy ZFS_IOC_SEND consumer, along with their userland
counterpart estimate_ioctl(), can leverage ZFS_IOC_SEND_SPACE to
request send space estimation.

The legacy functionality in zfs_ioc_send() is left untouched for
compatibility purposes.

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6029
2017-08-31 09:00:35 -07:00
Richard Lowe
1afc54f7f4 OpenZFS 2976 - remove useless offsetof() macros
Authored by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2976
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5c5f137
Closes #6582
2017-08-30 15:53:38 -07:00
Gvozden Neskovic
d22323e89f dmu_objset: release bonus buffer in failure path
Reported by kmemleak during testing of a new patch:

```
unreferenced object 0xffff9f1c12e38800 (size 1024):
  comm "z_upgrade", pid 17842, jiffies 4296870904 (age 8746.268s)
  backtrace:
    kmemleak_alloc+0x7a/0x100
    __kmalloc_node+0x26c/0x510
    range_tree_create+0x39/0xa0 [zfs]
    dmu_zfetch_init+0x73/0xe0 [zfs]
    dnode_create+0x12c/0x3b0 [zfs]
    dnode_hold_impl+0x1096/0x1130 [zfs]
    dnode_hold+0x23/0x30 [zfs]
    dmu_bonus_hold_impl+0x6b/0x370 [zfs]
    dmu_bonus_hold+0x1e/0x30 [zfs]
    dmu_objset_space_upgrade+0x114/0x310 [zfs]
    dmu_objset_userobjspace_upgrade_cb+0xd8/0x150 [zfs]
    dmu_objset_upgrade_task_cb+0x136/0x1e0 [zfs]    
    kthread+0x119/0x150
```

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6575
2017-08-30 12:09:18 -07:00
Eli Rosenthal
74ea6092d0 OpenZFS 7028 - avl_destroy_nodes supports emptying, not just destroying, an avl tree
Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7028
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/86f617e
Closes #6583
2017-08-30 12:08:38 -07:00
Steve Dougherty
de327eccbb OpenZFS 6447 - handful of nvpair cleanups
Authored by: Steve Dougherty <sdougherty@barracuda.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6447
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/759e89b
Closes #6581
2017-08-30 12:04:27 -07:00
Andriy Gapon
ecaebdbcf6 OpenZFS 5778 - nvpair_type_is_array() does not recognize DATA_TYPE_INT8_ARRAY
Authored by: Andriy Gapon <avg@icyb.net.ua>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5778
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bf4d553
Closes #6580
2017-08-30 12:00:58 -07:00
Matthew Ahrens
24ded86e8d OpenZFS 7261 - nvlist code should enforce name length limit
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7261
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/48dd5e6
Closes #6579
2017-08-30 11:58:00 -07:00
Matthew Ahrens
006309e8d7 OpenZFS 8375 - Kernel memory leak in nvpair code
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8375
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843c211
Closes #6578
2017-08-30 11:50:12 -07:00
Matthew Ahrens
1e0457e7f5 Enhance comments for large dnode project
Fix a few nits in the comments from large dnodes. Also import
some of the commit message as a comment in the code, making
it more accessible.

Reviewed-by: @rottegift 
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matt Ahrens <mahrens@delphix.com>
Closes #6551
2017-08-29 09:00:28 -07:00
dbavatar
2209e40981 Linux 4.8+ compatibility fix for vm stats
vm_node_stat must be used instead of vm_zone_stat. Unfortunately the
old code still compiles potentially leading to silent failure of
arc_evictable_memory()

AKAMAI: CR 3816601: Regression in zfs dropcache test

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #6528
2017-08-24 10:48:23 -07:00
chrisrd
2fb1a234ab dbuf_cons: deduplicate multilist_link_init()
Remove harmless duplicate multilist_link_init() introduced by
commit d3c2ae1.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6552
2017-08-24 10:31:59 -07:00
Alek P
e4b6b2db12 OpenZFS 8414 - Implemented zpool scrub pause/resume
Authored by: Alek Pinchuk <apinchuk@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://www.illumos.org/issues/8414
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c29616076
Closes #6538
2017-08-24 10:27:20 -07:00
Tom Caputi
9b8407638d Send / Recv Fixes following b52563
This patch fixes several issues discovered after
the encryption patch was merged:

* Fixed a bug where encrypted datasets could attempt
  to receive embedded data records.

* Fixed a bug where dirty records created by the recv
  code wasn't properly setting the dr_raw flag.

* Fixed a typo where a dmu_tx_commit() was changed to
  dmu_tx_abort()

* Fixed a few error handling bugs unrelated to the
  encryption patch in dmu_recv_stream()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6512 
Closes #6524 
Closes #6545
2017-08-23 16:54:24 -07:00
Chunwei Chen
05f85a6a64 Fix zfs_ioc_pool_sync should not use fnvlist
Use fnvlist on user input would allow user to easily panic zfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6529
2017-08-21 13:11:11 -07:00
Gvozden Neskovic
551905dd47 vdev_mirror: kstat observables for preferred vdev
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:54 -07:00
Gvozden Neskovic
d6c6590c5d vdev_mirror: load balancing fixes
vdev_queue:
- Track the last position of each vdev, including the io size,
  in order to detect linear access of the following zio.
- Remove duplicate `vq_lastoffset`

vdev_mirror:
- Correctly calculate the zio offset (signedness issue)
- Deprecate `vdev_queue_register_lastoffset()`
- Add `VDEV_LABEL_START_SIZE` to zio offset of leaf vdevs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:16 -07:00
LOLi
f763c3d1df Fix range locking in ZIL commit codepath
Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
offset and length to the offset and length of the BIO from
zvol_write()->zvol_log_write(): these offset and length are later used
to take a range lock in zillog->zl_get_data function: zvol_get_data().

Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
offset 0: we will only be range-locking 0-4096. This means the
ASSERTion we make in dbuf_unoverride() is no longer valid because now
dmu_sync() is called from zilog's get_data functions holding a partial
lock on the dbuf.

Fix this by taking a range lock on the whole block in zvol_get_data().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6238 
Closes #6315 
Closes #6356 
Closes #6477
2017-08-21 08:59:48 -07:00
LOLi
08de8c16f5 Fix remounting snapshots read-write
It's not enough to preserve/restore MS_RDONLY on the superblock flags
to avoid remounting a snapshot read-write: be explicit about our
intentions to the VFS layer so the readonly bit is updated correctly
in do_remount_sb().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6510 
Closes #6515
2017-08-17 14:28:17 -07:00
Brian Behlendorf
c8f9061fc7 Retire legacy test infrastructure
* Removed zpios kmod, utility, headers and man page.

* Removed unused scripts zpios-profile/*, zpios-test/*,
  zpool-config/*, smb.sh, zpios-sanity.sh, zpios-survey.sh,
  zpios.sh, and zpool-create.sh.

* Removed zfs-script-config.sh.in.  When building 'make' generates
  a common.sh with in-tree path information from the common.sh.in
  template.  This file and sourced by the test scripts and used
  for in-tree testing, it is not included in the packages.  When
  building packages 'make install' uses the same template to
  create a new common.sh which is appropriate for the packaging.

* Removed unused functions/variables from scripts/common.sh.in.
  Only minimal path information and configuration environment
  variables remain.

* Removed unused scripts from scripts/ directory.

* Remaining shell scripts in the scripts directory updated to
  cleanly pass shellcheck and added to checked scripts.

* Renamed tests/test-runner/cmd/ to tests/test-runner/bin/ to
  match install location name.

* Removed last traces of the --enable-debug-dmu-tx configure
  options which was retired some time ago.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6509
2017-08-15 17:26:38 -07:00
Don Brady
d977122da9 Add corruption failure option to zinject(8)
Added a 'corrupt' error option that will flip a bit in the data
after a read operation.  This is useful for generating checksum
errors at the device layer (in a mirror config for example). It
is also used to validate the diagnosis of checksum errors from
the zfs diagnosis engine.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6345
2017-08-14 15:17:15 -07:00
Tom Caputi
b525630342 Native Encryption for ZFS on Linux
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494 
Closes #5769
2017-08-14 10:36:48 -07:00
Chunwei Chen
376994828f Fix NULL pointer when O_SYNC read in snapshot
When doing read on a file open with O_SYNC, it will trigger zil_commit.
However for snapshot, there's no zil, so we shouldn't be doing that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6478 
Closes #6494
2017-08-11 08:57:54 -07:00
gaurkuma
761b8ec6bf Allow longer SPA names in stats
The pool name can be 256 chars long. Today, in /proc/spl/kstat/zfs/
the name is limited to < 32 characters. This change is to allows
bigger pool names.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #6481
2017-08-11 08:56:24 -07:00
gaurkuma
9df9692637 Allow longer SPA names in stats
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #641
2017-08-11 08:53:35 -07:00
Brian Behlendorf
c25b8f99f8 Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks

* Update the zk_thread_create() function to use the same trick
  as Illumos.  Specifically, cast the new pthread_t to a void
  pointer and return that as the kthread_t *.  This avoids the
  issues associated with managing a wrapper structure and is
  safe as long as the callers never attempt to dereference it.

* Update all function prototypes passed to pthread_create() to
  match the expected prototype.  We were getting away this with
  before since the function were explicitly cast.

* Replaced direct zk_thread_create() calls with thread_create()
  for code consistency.  All consumers of libzpool now use the
  proper wrappers.

* The mutex_held() calls were converted to MUTEX_HELD().

* Removed all mutex_owner() calls and retired the interface.
  Instead use MUTEX_HELD() which provides the same information
  and allows the implementation details to be hidden.  In this
  case the use of the pthread_equals() function.

* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
  any non essential fields removed.  In the case of kthread_t
  and kcondvar_t they could be directly typedef'd to pthread_t
  and pthread_cond_t respectively.

* Removed all extra ASSERTS from the thread, mutex, rwlock, and
  cv wrapper functions.  In practice, pthreads already provides
  the vast majority of checks as long as we check the return
  code.  Removing this code from our wrappers help readability.

* Added TS_JOINABLE state flag to pass to request a joinable rather
  than detached thread.  This isn't a standard thread_create() state
  but it's the least invasive way to pass this information and is
  only used by ztest.

TEST_ZTEST_TIMEOUT=3600

Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547 
Closes #5503 
Closes #5523 
Closes #6377 
Closes #6495
2017-08-11 08:51:44 -07:00
sanjeevbagewadi
21df134f4c zio_dva_throttle_done() should allow zinjected ZIO
If fault injection is enabled, the ZIO_FLAG_IO_RETRY could be set by
zio_handle_device_injection() to generate the FMA events and update
stats. Hence, ignore the flag and process such zios.

A better fix would be to add another flag in the zio_t to indicate that
the zio is failed because of a zinject rule. However, considering the
fact that we do this in debug bits, we could do with the crude check
using the global flag zio_injection_enabled which is set to 1 when
zinject records are added.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #6383 
Closes #6384
2017-08-10 15:53:40 -07:00
Fabian-Gruenbichler
bbefaeba29 make module/spl/spl-kmem.c non-executable (again)
This was probably accidentally committed in

aeb9baa618
Fix: handle NULL case in spl_kmem_free_track()

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #644
2017-08-10 15:23:43 -07:00
Brian Behlendorf
46364cb2f3 Add libtpool (thread pools)
OpenZFS provides a library called tpool which implements thread
pools for user space applications.  Porting this library means
the zpool utility no longer needs to borrow the kernel mutex and
taskq interfaces from libzpool.  This code was updated to use
the tpool library which behaves in a very similar fashion.

Porting libtpool was relatively straight forward and minimal
modifications were needed.  The core changes were:

* Fully convert the library to use pthreads.
* Updated signal handling.
* lmalloc/lfree converted to calloc/free
* Implemented portable pthread_attr_clone() function.

Finally, update the build system such that libzpool.so is no
longer linked in to zfs(8), zpool(8), etc.  All that is required
is libzfs to which the zcommon soures were added (which is the way
it always should have been).  Removing the libzpool dependency
resulted in several build issues which needed to be resolved.

* Moved zfeature support to module/zcommon/zfeature_common.c
* Moved ratelimiting to to module/zfs/zfs_ratelimit.c
* Moved get_system_hostid() to lib/libspl/gethostid.c
* Removed use of cmn_err() in zcommon source
* Removed dprintf_setup() call from zpool_main.c and zfs_main.c
* Removed highbit() and lowbit()
* Removed unnecessary library dependencies from Makefiles
* Removed fletcher-4 kstat in user space
* Added sha2 support explicitly to libzfs
* Added highbit64() and lowbit64() to zpool_util.c

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6442
2017-08-09 15:31:08 -07:00
Boris Protopopov
5146d802b4 zv_suspend_lock in zvol_open()/zvol_release()
Acquire zv_suspend_lock on first open and last close only.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #6342
2017-08-09 11:10:47 -07:00
Ned Bass
6a8ee4f71d Add debug log entries for failed receive records
Log contents of a receive record if an error occurs while writing
it out to the pool. This may help determine the cause when backup
streams are rejected as invalid.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6465
2017-08-08 08:41:31 -07:00
Brian Behlendorf
9631681b75 Fix dnode allocation race
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6414 
Closes #6439
2017-08-08 08:38:53 -07:00
Boris Protopopov
9243b0fb47 Add assert under lock to detect cases of dispach of a preallocated
taskq work item to more than one queue concurrently. Also, please
see discussion in zfsonlinux/zfs#3840.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #609
2017-08-08 08:31:52 -07:00
Chunwei Chen
cce83ba0ec Fix use-after-free in taskq_seq_show_impl
taskq_seq_show_impl walks the tq_active_list to show the tqent_func and
tqent_arg. However for taskq_dispatch_ent, it's very likely that the
task entry will be freed during the function call, and causes a
use-after-free bug.

To fix this, we duplicate the task entry to an on-stack struct, and
assign it instead to tqt_task. This way, the tq_lock alone will
guarantee its safety.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #638 
Closes #640
2017-08-04 09:57:58 -07:00
Chunwei Chen
6ecfd2b553 Add __divmoddi4 and __udivmoddi4 for 32-bit arch
gcc-7 seems to use __udivmoddi4 for 64-bit division on 32-bit arch. This
patch implement them so we don't get undefined reference error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes zfsonlinux/zfs#6417 
Closes #636
2017-08-03 10:41:42 -07:00
Ned Bass
ecb2b7dc7f Use SET_ERROR for constant non-zero return codes
Update many return and assignment statements to follow the convention
of using the SET_ERROR macro when returning a hard-coded non-zero
value from a function. This aids debugging by recording the error
codes in the debug log.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6441
2017-08-02 21:16:12 -07:00
Tony Hutter
6710381680 Only record zio->io_delay on reads and writes
While investigating https://github.com/zfsonlinux/zfs/issues/6425 I
noticed that ioctl ZIOs were not setting zio->io_delay correctly.  They
would set the start time in zio_vdev_io_start(), but never set the end
time in zio_vdev_io_done(), since ioctls skip it and go straight to
zio_done().  This was causing spurious "delayed IO" events to appear,
which would eventually get rate-limited and displayed as
"Missed events" messages in zed.

To get around the problem, this patch only sets zio->io_delay for read
and write ZIOs, since that's all we care about anyway.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6425 
Closes #6440
2017-08-02 09:08:38 -07:00
LOLi
c7a7601c08 Fix volmode=none property behavior at import time
At import time spa_import() calls zvol_create_minors() directly: with
the current implementation we have no way to avoid device node
creation when volmode=none.

Fix this by enforcing volmode=none directly in zvol_alloc().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6426
2017-07-31 11:07:05 -07:00
LOLi
650258d7c7 zfs promote|rename .../%recv should be an error
If we are in the middle of an incremental 'zfs receive', the child
.../%recv will exist. If we run 'zfs promote' .../%recv, it will "work",
but then zfs gets confused about the status of the new dataset.
Attempting to do this promote should be an error.

Similarly renaming .../%recv datasets should not be allowed.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #4843 
Closes #6339
2017-07-28 14:12:34 -07:00
Andriy Gapon
f06f53fa3f OpenZFS 7915 - checks in l2arc_evict could use some cleaning up
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7915
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/836a00c
Closes #6375
2017-07-28 14:09:49 -07:00
Andriy Gapon
e98b611725 OpenZFS 8373 - TXG_WAIT in ZIL commit path
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8373
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7f04961
Closes #6403
2017-07-28 14:08:20 -07:00
Giuseppe Di Natale
bff245dd34 OpenZFS 8508 - Mounting a zpool on 32-bit platforms panics
Authored by: Justin Hibbits <chmeeedalf@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8508
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/15fc257
Closes #6404
2017-07-26 09:44:21 -07:00
Ned Bass
8740cf4a2f Add line info and SET_ERROR() to ZFS debug log
Redefine the SET_ERROR macro in terms of __dprintf() so the error
return codes get logged as both tracepoint events (if tracepoints are
enabled) and as ZFS debug log entries.  This also allows us to use
the same definition of SET_ERROR() in kernel and user space.

Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise
or'd into zfs_flags. Setting this flag enables both dprintf() and
SET_ERROR() messages in the debug log. That is, setting
ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF|ZFS_DEBUG_SET_ERROR are
equivalent (this was done for sake of simplicity). Leaving
ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which
helps avoid cluttering up the logs.

To enable SET_ERROR() logging, run:

  echo 1 >   /sys/module/zfs/parameters/zfs_dbgmsg_enable
  echo 512 > /sys/module/zfs/parameters/zfs_flags

Remove the zfs_set_error_class tracepoints event class since
SET_ERROR() now uses __dprintf(). This sacrifices a bit of
granularity when selecting individual tracepoint events to enable but
it makes the code simpler.

Include file, function, and line number information in debug log
entries.  The information is now added to the message buffer in
__dprintf() and as a result the zfs_dprintf_class tracepoints event
class was changed from a 4 parameter interface to a single parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6400
2017-07-25 23:09:48 -07:00
Oleg Drokin
410f7ab594 Module parameter to enable spl_panic() to panic the kernel
In unattended operations it's often more useful to have node
panic and reboot when it encounters problems as opposed to
sit there indefinitely waiting for somebody to discover it.

This implements an spl_panic_crash module parameter, set it
to nonzero to cause spl_panic() to call panic().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #634
2017-07-25 23:03:12 -07:00
Ned Bass
73aac4aa41 Some additional send stream validity checking
Check in the DMU whether an object record in a send stream being
received contains an unsupported dnode slot count, and return an
error if it does. Failure to catch an unsupported dnode slot count
would result in a panic when the SPA attempts to increment the
reference count for the large_dnode feature and the pool has the
feature disabled. This is not normally an issue for a well-formed
send stream which would have the DMU_BACKUP_FEATURE_LARGE_DNODE flag
set if it contains large dnodes, so it will be rejected as
unsupported if the required feature is disabled. This change adds a
missing object record field validation.

Add missing stream feature flag checks in
dmu_recv_resume_begin_check().

Consolidate repetitive comment blocks in dmu_recv_begin_check().

Update zstreamdump to print the dnode slot count (dn_slots) for an
object record when running in verbose mode.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6396
2017-07-25 18:52:40 -07:00
Brian Behlendorf
3f759c0c73 Fix 'zpool clear' on suspended pools
'zpool clear' should be able to resume I/O on suspended, but otherwise
healthy, pools.

4a283c7 accidentally introduced a new code path where we call
txg_wait_synced() on the suspended pool before we had the chance to
resume I/O via zio_resume(): this results in the 'zpool clear'
command hanging indefinitely, waiting for a TXG that cannot be synced.

Fix this by avoiding the call to txg_wait_synced().

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6399
2017-07-25 12:20:52 -07:00
Olaf Faaland
e889f0f520 Report MMP_STATE_NO_HOSTID immediately
There is no need to perform the activity check before detecting that the
user must set the system hostid, because the pool's multihost property
is on, but spa_get_hostid() returned 0.  The initial call to
vdev_uberblock_load() provided the information required.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6388
2017-07-25 13:22:28 -04:00
Olaf Faaland
0582e40322 Add callback for zfs_multihost_interval
Add a callback to wake all running mmp threads when
zfs_multihost_interval is changed.

This is necessary when the interval is changed from a very large value
to a significantly lower one, while pools are imported that have the
multihost property enabled.

Without this commit, the mmp thread does not wake up and detect the new
interval until after it has waited the old multihost interval time.  A
user monitoring mmp writes via the provided kstat would be led to
believe that the changed setting did not work.

Added a test in the ZTS under mmp to verify the new functionality is
working.

Added a test to ztest which starts and stops mmp threads, and calls into
the code to signal sleeping mmp threads, to test for deadlocks or
similar locking issues.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6387
2017-07-25 13:22:20 -04:00
Olaf Faaland
ffb195c256 Release SCL_STATE in map_write_done()
The config lock must be held for the duration of the MMP write.
Since the I/Os are executed via map_nowait(), the done function
is the only place where we know the write has completed.

Since SCL_STATE is taken as reader, overlapping I/Os do not
create a deadlock.  The refcount is simply increased when new
I/Os are queued and decreased when I/Os complete.

Test case added which exercises the probe IO call path to
verify the fix and prevent a regression.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6394
2017-07-25 12:25:05 -04:00
Olaf Faaland
f43615d0cc Revert Fix vdev_probe() call wrt SCL_STATE_ALL
This reverts commit cc9c6bc, which has been causing intermittent
test failures on buildbot.  A correct fix for this locking issue
has been applied in a separate patch.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-07-25 12:24:42 -04:00
LOLi
871e07321c Fix buffer overflow in dsl_dataset_name()
If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11)
we need to account for additional space needed by the origin dataset
which will also be snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6374
2017-07-24 12:56:49 -07:00
Olaf Faaland
b6e5c40382 Use correct macro for hz in mmp.c
Commit 379ca9c Multi-modifier protection (MMP) used HZ to convert
nanoseconds to ticks for use with cv_timedwait() and ddi_get_lbolt().
The correct macro is hz, which is defined within the SPL for kernel
space, and within zfs_context.h for user space.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6357 
Closes #6360
2017-07-24 11:22:10 -07:00
Giuseppe Di Natale
802ae562ed Fix coverity defects: CID 165755
CID 165755: Division or modulo by zero (DIVIDE_BY_ZERO)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6352
2017-07-24 11:16:58 -07:00
LOLi
cd47801828 Avoid WARN() from procfs on kstat collision
When we load a ZFS pool having spa_name equals to some existing kstat
we would have to create a duplicate entry, which procfs doesn't like.

For instance a ZFS pool named "zil" would have its kstat "txgs"
(module "zfs/zil") intalled under "/proc/spl/kstat/zfs/zil":
unfortunately we already have a kstat named "zil" (module "zfs")
installed in the same procfs location.

Avoid this issue by skipping the duplicate entry creation in procfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #628
2017-07-24 10:52:53 -07:00
Brian Behlendorf
36ba27e9e0 Linux 4.13 compat: bio->bi_status and blk_status_t
Commit torvalds/linux@4e4cbee9.  The bio->bi_error field was
replaced with bio->bi_status which is an enum that describes
all possible error types.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6351
2017-07-23 19:37:12 -07:00
Brian Behlendorf
944117514d Linux 4.13 compat: wait queues
Commit torvalds/linux@ac6424b9
- Renamed struct wait_queue -> struct wait_queue_entry.

Commit torvalds/linux@2055da97
- Renamed wait_queue_head::task_list -> wait_queue_head::head
- Renamed wait_queue_entry::task_list -> wait_queue_entry::entry

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #629
2017-07-23 19:32:14 -07:00
Brian Behlendorf
cc9c6bcb73 Fix vdev_probe() call outside SCL_STATE_ALL lock
When an IO fails then zio_vdev_io_done() can call vdev_probe()
to determine the health of the vdev.  This is safe as long as
the original zio was submitted with zio_wait() and holds the
SCL_STATE_ALL lock over the operation.

If zio_no_wait() was used then the done callback will submit
the probe IO outside the SCL_STATE_ALL lock and hit this
ASSERT in zio_create()

  ASSERT(!vd || spa_config_held(spa, SCL_STATE_ALL, RW_READER));

Resolve the issue by only allowing vdev_probe() to be called
when there's a waiter indicating the caller is using zio_wait().
This assumes that caller is still holding SCL_STATE_ALL.

This issue isn't MMP specific but was surfaced when testing.
Without this patch it can be reproduced by running:

  zpool set multihost on <pool>
  zinject -d <vdev> -e io -T write -f 50 <pool> -L uber

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #745
Closes #6279
2017-07-13 13:54:10 -04:00
Olaf Faaland
379ca9cf2b Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP.  When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp.  Property defaults to off.

During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock.  Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".

Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval.  The period is specified in milliseconds.  The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.

Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path.  Abbreviated
output below.

$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg   timestamp  mmp_delay   vdev_guid   vdev_label vdev_path
20468    261337  250274925   68396651780       3    /dev/sda
20468    261339  252023374   6267402363293     1    /dev/sdc
20468    261340  252000858   6698080955233     1    /dev/sdx
20468    261341  251980635   783892869810      2    /dev/sdy
20468    261342  253385953   8923255792467     3    /dev/sdd
20468    261344  253336622   042125143176      0    /dev/sdab
20468    261345  253310522   1200778101278     2    /dev/sde
20468    261346  253286429   0950576198362     2    /dev/sdt
20468    261347  253261545   96209817917       3    /dev/sds
20468    261349  253238188   8555725937673     3    /dev/sdb

Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.

When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test.  For example, the
pool is exported to run zdb and then imported again.  Add a new ztest
function, "-M", to alter ztest behavior to prevent this.

Add new tests to verify the new functionality.  Tests provided by
Giuseppe Di Natale.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-13 13:54:00 -04:00
Brian Behlendorf
c93d9dff36 Don't cache the system hostid
Historically the SPL cached the system hostid the first time it
was accessed.  This was done to speed up subsequent accesses.
But in practice the system host id is rarely accessed and its
inconvenient that it doesn't promptly detect /etc/hostid
configuration changes.  Therefore, zone_get_hostid() has been
updated to always refresh the system hostid reported.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #626
2017-07-13 13:22:28 -04:00
Dave Eddy
12fa0466df OpenZFS 6939 - add sysevents to zfs core for commands
Authored by: Dave Eddy <dave@daveeddy.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6939
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b
Closes #6328
2017-07-12 21:28:13 -07:00
LOLi
cf8738d853 Add port of FreeBSD 'volmode' property
The volmode property may be set to control the visibility of ZVOL
block devices.

This allow switching ZVOL between three modes:
   full - existing fully functional behaviour (default)
   dev  - hide partitions on ZVOL block devices
   none - not exposing volumes outside ZFS

Additionally the new zvol_volmode module parameter can be used to
control the default behaviour.

This functionality can be used, for instance, on "backup" pools to
avoid cluttering /dev with unneeded zd* devices.

Original-patch-by: mav <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb
Closes #1796 
Closes #3438 
Closes #6233
2017-07-12 13:05:37 -07:00
Yuri Pankov
e19572e4cc OpenZFS 5428 - provide fts(), reallocarray(), and strtonum()
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* All hunks unrelated to ZFS were dropped.

OpenZFS-issue: https://www.illumos.org/issues/5428
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130
Closes #6326
2017-07-08 20:35:35 -07:00
Matthew Ahrens
2ade4a99f0 OpenZFS 8126 - ztest assertion failed in dbuf_dirty due to dn_nlevels changing
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8126
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0ef125d
Closes #6314
2017-07-07 14:58:33 -07:00
Matthew Ahrens
a896468c78 OpenZFS 8067 - zdb should be able to dump literal embedded block pointer
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8067
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085
Closes #6319
2017-07-07 11:28:01 -07:00
LOLi
92e43c1718 Fix 'zpool clear' on readonly pools
Illumos 4080 inadvertently allows 'zpool clear' on readonly pools: fix
this by reintroducing a check (POOL_CHECK_READONLY) in zfs_ioc_clear
registration code.

Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6306
2017-07-07 10:39:53 -07:00
Alek P
0ea05c64f8 Implemented zpool scrub pause/resume
Currently, there is no way to pause a scrub. Pausing may
be useful when the pool is busy with other I/O to preserve
bandwidth.

This patch adds the ability to pause and resume scrubbing.
This is achieved by maintaining a persistent on-disk scrub state.
While the state is 'paused' we do not scrub any more blocks.
We do however perform regular scan housekeeping such as
freeing async destroyed and deadlist blocks while paused.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6167
2017-07-06 22:16:13 -07:00
Arkadiusz Bubała
94b25662c5 Reschedule processes on -ERESTARTSYS
On the single core machine the system may hang when the
spa_namespare_lock acquisition fails in the zvol_first_open
function. It returns -ERESTARTSYS error what causes the
endless loop in __blkdev_get function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #6283 
Closes #6312
2017-07-06 08:38:24 -07:00
alaviss
688c94c5c0 Clang fixes
Clang doesn't support `/` as comment in assembly, this patch replaces
them with `#`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Leorize <alaviss@users.noreply.github.com>
Closes #6311
2017-07-05 10:38:20 -07:00
Matthew Ahrens
02dc43bc46 OpenZFS 8378 - crash due to bp in-memory modification of nopwrite block
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

The problem is that zfs_get_data() supplies a stale zgd_bp to
dmu_sync(), which we then nopwrite against.
zfs_get_data() doesn't hold any DMU-related locks, so after it
copies db_blkptr to zgd_bp, dbuf_write_ready() could change
db_blkptr, and dbuf_write_done() could remove the dirty record.
dmu_sync() then sees the stale BP and that the dbuf it not dirty,
so it is eligible for nop-writing.
The fix is for dmu_sync() to copy db_blkptr to zgd_bp after
acquiring the db_mtx. We could still see a stale db_blkptr,
but if it is stale then the dirty record will still exist and
thus we won't attempt to nopwrite.

OpenZFS-issue: https://www.illumos.org/issues/8378
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742
Closes #6293
2017-07-04 15:41:24 -07:00
Andriy Gapon
8ca78ab002 OpenZFS 7600 - zfs rollback should pass target snapshot to kernel
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

The existing kernel-side code only provides a method to rollback to a
latest snapshot, whatever it happens to be at the time when the rollback
is actually done.  That could be unsafe or confusing in environments
where concurrent DSL changes are possible as the resulting state could
correspond to a newer or older snapshot than the originally requested
one.
This change allows to amend that method such that the rollback is
performed only when the latest snapshot has a specific name.  That is,
if a new snapshot is concurrently created or the target snapshot is
destroyed, then no rollback is done and EXDEV error is returned.
New libzfs_core function lzc_rollback_to() is provided for the new
functionality.  libzfs is changed to use lzc_rollback_to() to implement
zfs rollback command.
Perhaps we should return different errors to distinguish the case where
the desired snapshot exists but it's not the latest snapshot and the
case where the desired snapshot does not exist.

OpenZFS-issue: https://www.illumos.org/issues/7600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb
Closes #6292
2017-07-04 15:29:52 -07:00
Andriy Gapon
018503911c OpenZFS 7910 - l2arc_write_buffers() may write beyond target_sz
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7910
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb6af4b
Closes #6291
2017-07-04 15:28:58 -07:00
Matthew Ahrens
c6f6767eea OpenZFS 8377 - Panic in bookmark deletion
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

The problem is that when dsl_bookmark_destroy_check() is
executed from open context (the pre-check), it fills in
dbda_success based on the existence of the bookmark. But
the bookmark (or containing filesystem as in this case)
can be destroyed before we get to syncing context. When
we re-run dsl_bookmark_destroy_check() in syncing context,
it will not add the deleted bookmark to dbda_success,
intending for dsl_bookmark_destroy_sync() to not process
it. But because the bookmark is still in dbda_success from
the open-context call, we do try to destroy it.
The fix is that dsl_bookmark_destroy_check() should not
modify dbda_success when called from open context.

OpenZFS-issue: https://www.illumos.org/issues/8377
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b0b6fe3
Closes #6286
2017-06-30 11:11:01 -07:00
Matthew Ahrens
817b1b6e7b Clean up large dnode code
Resolves issues discovered when porting to OpenZFS.

* Lint warnings.
* Made dnode_move_impl() large dnode aware.  This
  functionality is currently unused on Linux.

Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6262
2017-06-29 10:18:03 -07:00
chrisrd
b8a97fb101 Set arc_meta_limit, arc_dnode_limit on change
Make zfs_arc_meta_limit_percent and zfs_arc_dnode_limit_percent behave
as you would expect from zfs-module-parameters.5.

- recalculate arc_meta_limit if zfs_arc_meta_limit_percent changes
- recalculate arc_dnode_limit if zfs_arc_dnode_limit_percent changes
- correctly set arc_meta_limit and arc_dnode_limit if zfs_arc_max or
  zfs_arc_meta_min changes

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6269
2017-06-29 09:57:27 -07:00
Tony Hutter
682ce104cd GCC 7.1 fixes
GCC 7.1 with will warn when we're not checking the snprintf()
return code in cases where the buffer could be truncated. This
patch either checks the snprintf return code (where applicable),
or simply disables the warnings (ztest.c).

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6253
2017-06-28 10:05:16 -07:00
Brian Behlendorf
2d678f779a Cap maximum aggregate IO size
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit.  Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259 
Closes #6270
2017-06-27 10:09:16 -07:00
Boris Protopopov
58404a73db Refine use of zv_state_lock.
Use zv_state_lock to protect all members of zvol_state structure, add
relevant ASSERT()s. Take zv_suspend_lock before zv_state_lock, do not
hold zv_state_lock across suspend/resume.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #6226
2017-06-27 12:51:44 -04:00
Giuseppe Di Natale
82710e993a OpenZFS 5220 - L2ARC does not support devices that do not provide 512B access
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5220
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/403a8da
Closes #6260
2017-06-26 17:32:43 -07:00
Giuseppe Di Natale
d12f91fde3 OpenZFS 8264 - want support for promoting datasets in libzfs_core
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8264
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a
Closes #6254
2017-06-26 16:56:09 -07:00
Boris Protopopov
03928896e1 Call cv_signal() with mutex held
In bqueue_dequeue(), call cv_signal() with bq_lock held.
Re-enable rsend_009_pos to test the fix.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #5887
2017-06-26 14:36:49 -07:00
Morgan Jones
d9ad3fea3b Add kpreempt_disable/enable around CPU_SEQID uses
In zfs/dmu_object and icp/core/kcf_sched, the CPU_SEQID macro
should be surrounded by `kpreempt_disable` and `kpreempt_enable`
calls to avoid a Linux kernel BUG warning.  These code paths use
the cpuid to minimize lock contention and is is safe to reschedule
the process to a different processor at any time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Morgan Jones <me@numin.it>
Closes #6239
2017-06-19 09:43:16 -07:00
Don Brady
0241e491a0 Inject zinject(8) a percentage amount of dev errs
In the original form of device error injection, it was an all or nothing
situation.  To help simulate intermittent error conditions, you can now
specify a real number percentage value. This is also very useful for our
ZFS fault diagnosis testing and for injecting intermittent errors during
load testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6227
2017-06-16 17:21:11 -07:00
Boris Protopopov
ef4be34a64 Avoid 'queue not locked' warning at pool import.
Use queue_flag_set_unlocked() in zvol_alloc().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Issue #6226
2017-06-15 11:16:19 -07:00
LOLi
97f8d7961e Fix zvol_state_t->zv_open_count race
5559ba0 added zv_state_lock to protect zvol_state_t internal data:
this, however, doesn't guard zv->zv_open_count and
zv->zv_disk->private_data in zvol_remove_minors_impl().

Fix this by taking zv->zv_state_lock before we check its zv_open_count.

P1 (z_zvol)                       P2 (systemd-udevd)
---                               ---
zvol_remove_minors_impl()
: zv->zv_open_count==0
                                  zvol_open()
                                  ->mutex_enter(zv_state_lock)
                                  : zv->zv_open_count++
                                  ->mutex_exit(zv_state_lock)
->mutex_enter(zv->zv_state_lock)
->zvol_remove(zv)
->mutex_exit(zv->zv_state_lock)
: zv->zv_disk->private_data = NULL
->zvol_free()
-->ASSERT(zv->zv_open_count==0) *
                                  zvol_release()
                                  : zv = disk->private_data
                                  ->ASSERT(zv && zv->zv_open_count>0) *
---                               ---
* ASSERT() fails

Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6213
2017-06-15 11:08:45 -07:00
Richard Yao
8f7933fec9 Fix zvol_init error handling
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@prophetstor.com>
2017-06-13 15:29:21 -04:00
Richard Yao
5228cf0116 Make zvol operations use _by_dnode routines
This continues what was started in
0eef1bde31 by fully converting zvols
to avoid unnecessary dnode_hold() calls. This saves a small amount
of CPU time and slightly improves latencies of operations on zvols.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@prophetstor.com>
Closes #6058
2017-06-13 09:18:08 -07:00
DeHackEd
419c80e6dc Reduce stack usage of dsl_dir_tempreserve_impl
Buildbots and zfs-tests regularly see 7 kilobytes of stack
usage with this function. Convert self-calls to iterations

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #6219
2017-06-12 11:41:03 -07:00
Paul Dagnelie
dd429b46b7 OpenZFS 8056 - zfs send size estimate is inaccurate for some zvols
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

The send size estimate for a zvol can be too low, if the size of the
record headers (dmu_replay_record_t's) is a significant portion of the
size. This is typically the case when the data is highly compressible,
especially with embedded blocks.

The problem is that dmu_adjust_send_estimate_for_indirects() assumes
that blocks are the size of the "recordsize" property (128KB). However,
for zvols, the blocks are the size of the "volblocksize" property (8KB).
Therefore, we estimate that there will be 16x less record headers than
there really will be.

The fix is to check the type of the object set (whether it is a zvol or
not) and pick the appropriate property. In addition, while we are at it,
we also add the size of the BEGIN and END records to the estimate.

OpenZFS-issue: https://www.illumos.org/issues/8056
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faf09cd
Closes #6205
2017-06-09 09:46:14 -07:00
Matthew Ahrens
38240ebd7a OpenZFS 8156 - dbuf_evict_notify() does not need dbuf_evict_lock
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should
do the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.  It isn't necessary to hold
the lock, because if we make the wrong choice occasionally, nothing bad
will happen. This commit results in a ~60% performance improvement for
ARC-cached sequential reads.

OpenZFS-issue: https://www.illumos.org/issues/8156
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f73e5d9
Closes #6204
2017-06-09 09:45:13 -07:00
Matthew Ahrens
dbeb879699 OpenZFS 8199 - multi-threaded dmu_object_alloc()
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock.  To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.

The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs.  When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128).  Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.

This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.

Work sponsored by Intel Corp.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
2017-06-09 09:43:26 -07:00
Giuseppe Di Natale
1b7c1e5ce9 OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

 - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG.  Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

 - Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size.  In some
cases efficiency stopped to almost as low as 50%.  In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas.  This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently.  Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

 - While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by:   iXsystems, Inc.

Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 09:15:37 -07:00
Matthew Ahrens
82644107c4 OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffers
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

When writing pre-compressed buffers, arc_write() requires that
the compression algorithm used to compress the buffer matches
the compression algorithm requested by the zio_prop_t, which is
set by dmu_write_policy(). This makes dmu_write_policy() and its
callers a bit more complicated.

We simplify this by making arc_write() trust the caller to supply
the type of pre-compressed buffer that it wants to write,
and override the compression setting in the zio_prop_t.

OpenZFS-issue: https://www.illumos.org/issues/8155
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58
Closes #6200
2017-06-07 14:16:01 -04:00
LOLi
9f7b066bd9 Linux 4.9 compat: fix zfs_ctldir xattr handling
Since torvalds/linux@d0a5b99 IOP_XATTR is used to indicate the inode
has xattr support: clear it for the ctldir inodes to avoid EIO errors.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6189
2017-06-05 11:26:25 -07:00
LOLi
92aceb2a7e Fix "snapdev" property issues
When inheriting the "snapdev" property to we don't always call
zfs_prop_set_special(): this prevents device nodes from being created in
certain situations. Because "snapdev" is the only *special* property
that is also inheritable we need to call zfs_prop_set_special() even
when we're not reverting it to the received value ('zfs inherit -S').

Additionally, fix a NULL pointer dereference accidentally introduced in
5559ba0 that can be triggered when setting the "snapdev" property to
the value "hidden" twice.

Finally, add a new test case "zvol_misc_snapdev" to the ZFS Test Suite.

Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6131 
Closes #6175 
Closes #6176
2017-06-02 07:17:00 -07:00
Chunwei Chen
b870c4b5d7 Fix import wrong spare/l2 device when path change
If, for example, your aux device was /dev/sdc, but now the aux device is
removed and /dev/sdc points to other device. zpool import will still
use that device and corrupt it.

The problem is that the spa_validate_aux in spa_import, rather than
validate the on-disk label, it would actually write label to disk. We
remove them since spa_load_{spares,l2cache} seems to do everything we
need and they would actually validate on-disk label.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6158
2017-06-01 06:39:42 -07:00
LOLi
3f7d0418dc Fix memory leak in zvol_set_volsize()
Move kmem_free() so it's called for every error path: this is
preferred over making `dmu_object_info_t doi` local to accommodate
older kernels with limited stacks.

Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6177
2017-05-31 12:52:12 -07:00
Boris Protopopov
2d82116e80 Fix ida leak in zvol_create_minor_impl
Added missing ida_simple_remove() in the error handling path.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #6159 
Closes #6172
2017-05-26 17:50:25 -07:00
Alek P
9210e43a16 Don't dirty bpobj if it has no entries
In certain cases (dsl_scan_sync() is one), we may end up calling
bpobj_iterate() on an empty bpobj. Even though we don't end up
modifying the bpobj it still gets dirtied, causing unneeded writes
to the pool.

This patch adds an early bail from bpobj_iterate_impl() if bpobj
is empty to prevent unneeded writes.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6164
2017-05-26 11:42:10 -07:00
Brian Behlendorf
261c013fbf Revert "Fix "snapdev" property inheritance behaviour"
This reverts commit 959f56b993.
An issue was uncovered by the new zvol_misc_snapdev test case
which needs to be investigated and resolved.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6174 
Issue #6131
2017-05-26 11:40:44 -07:00
Alan Somers
0071036526 OpenZFS 8070 - Add some ZFS comments
Authored by: Alan Somers <asomers@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: bunder2015 <omfgbunder@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/8070
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40713f2
Closes #6160
2017-05-25 17:15:06 -07:00
LOLi
959f56b993 Fix "snapdev" property inheritance behaviour
When inheriting the "snapdev" property to we don't always call
zfs_prop_set_special(): this prevents device nodes from being created in
certain situations. Because "snapdev" is the only *special* property
that is also inheritable we need to call zfs_prop_set_special() even
when we're not reverting it to the received value ('zfs inherit -S').

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6131
2017-05-25 16:43:46 -07:00
Chunwei Chen
952e490b1b Improve gitignore
Ignore .*.d and exclude Makefile.in in module/
Also, ignore *.patch and *.orig files

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2017-05-25 10:14:13 -07:00
Chunwei Chen
3bda331ba8 Improve gitignore
Exclude Makefile.in in module/ and fix the gitignore in cmd/
Also, ignore *.patch and *.orig files

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2017-05-25 10:12:50 -07:00
Brian Behlendorf
2ded1c7eff Fix cv_timedwait timeout
Perform the already past expiration time check before updating
cvp->cv_mutex with the provided mutex.  This check only depends
on local state.  Doing it first ensures that cvp->cv_mutex will not
be updated in the timeout case or if it's ever called with an
expire_time <= now.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #616
2017-05-25 10:01:44 -07:00
LOLi
3e6c943347 Linux 4.12 compat: fix super_setup_bdi_name() call
Provide a format parameter to super_setup_bdi_name() so we don't
create duplicate names in '/devices/virtual/bdi' sysfs namespace which
would prevent us from mounting more than one ZFS filesystem at a time.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6147
2017-05-25 09:55:55 -07:00
Feng Sun
f871ab6ea2 Fix LZ4_uncompress_unknownOutputSize caused panic
Sync with kernel patches for lz4

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/lib/lz4

4a3a99 lz4: add overrun checks to lz4_uncompress_unknownoutputsize()
d5e7ca LZ4 : fix the data abort issue
bea2b5 lib/lz4: Pull out constant tables
99b7e9 lz4: fix system halt at boot kernel on x86_64

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #5975 
Closes #5973
2017-05-19 13:45:46 -07:00
Alek P
bec1067d54 Implemented zpool sync command
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6122
2017-05-19 12:33:11 -07:00
Tony Hutter
4a283c7f77 Force fault a vdev with 'zpool offline -f'
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline.  Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED.  The -f faults
persist across imports, unless they were set with the temporary
(-t) flag.  Both persistent and temporary faults can be cleared
with zpool clear.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6094
2017-05-19 12:30:16 -07:00
Tom Caputi
a32df59e18 Fixed small memory leak in ereport handling
One pre-check in zfs_ereport_start() was being called after
the nvlists were being allocated. This simply corrects that
issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6140
2017-05-18 17:35:49 -07:00
Boris Protopopov
5559ba094f Introduce zv_state_lock
The lock is designed to protect internal state of zvol_state_t and
to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path)
while holding zvol_stat_lock. Refactor the code accordingly.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3484
Closes #6065
Closes #6134
2017-05-16 19:44:06 -04:00
Boris Protopopov
07783588bc Revert commit 1ee159f4
Fix lock order inversion with zvol_open() as it did not account
for use of zvols as vdevs. The latter use cases resulted in the
lock order inversion deadlocks that involved spa_namespace_lock
and bdev->bd_mutex.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6065
Issue #6134
2017-05-16 19:42:47 -04:00
Isaac Huang
3d6da72d18 Skip spurious resilver IO on raidz vdev
On a raidz vdev, a block that does not span all child vdevs, excluding
its skip sectors if any, may not be affected by a child vdev outage or
failure. In such cases, the block does not need to be resilvered.
However, current resilver algorithm simply resilvers all blocks on a
degraded raidz vdev. Such spurious IO is not only wasteful, but also
adds the risk of overwriting good data.

This patch eliminates such spurious IOs.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5316
2017-05-12 17:28:03 -07:00
Matthew Ahrens
4747a7d3d4 OpenZFS 8063 - verify that we do not attempt to access inactive txg
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.

Porting Notes:
- ASSERTV added to txg_sync_waiting() for unused variable.

OpenZFS-issue: https://www.illumos.org/issues/8063
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46
Closes #6109
2017-05-10 13:52:22 -04:00
Matthew Ahrens
335b251ac1 OpenZFS 8166 - zpool scrub thinks it repaired offline device
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged.  When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started.  The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.

The fix is to never clear the DTL of offline devices.  Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.

The problem can be worked around by running "zpool scrub" after
"zpool online".

OpenZFS-issue: https://www.illumos.org/issues/8166
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/372
Closes #5806 
Closes #6103
2017-05-10 10:32:39 -07:00
Tom Caputi
f486f58440 Add missing arc_free_cksum() to arc_release()
The arc layer tracks checksums of its data in the arc header
so that it can ensure that buffers haven't changed when they're
not supposed to. This checksum is only maintained while there
is an uncompressed buffer still attached to the header.
Unfortunately there is a missing call to arc_free_cksum() in
arc_release() that can trigger ASSERTs. This has not been a
common issue because the checksums are only maintained for
debug builds and triggering the bug requires writing a block
(and therefore calling arc_release()) while a compressed buffer
is still being used on a debug build. This simply corrects the
issue.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6105
2017-05-10 10:25:27 -07:00
Brian Behlendorf
2946a1a15a Linux 4.12 compat: CURRENT_TIME removed
Linux 4.9 added current_time() as the preferred interface to get
the filesystem time.  CURRENT_TIME was retired in Linux 4.12.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6114
2017-05-10 09:30:48 -07:00
LOLi
a3eeab2de6 Add property overriding (-o|-x) to 'zfs receive'
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.

This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.

References:
   https://www.illumos.org/issues/2745
   https://www.illumos.org/issues/3753

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1350 
Closes #5349
2017-05-09 16:21:09 -07:00
Christian Schwarz
305bc4b370 Make createtxg and guid properties public
Document the existence of `createtxg` and `guid` native properties
in man pages and zfs command output.

One of the great features of ZFS is incremental replication of
snapshots, possibly between pools on different machines.

Shell scripts are commonly used to auomate this procedure. They have to
find the most recent common snapshot between both sides and then
perform incremental send & recv.
Currently, scripts rely on the sorting order of `zfs list`, which
defaults to `createtxg`, and the assumption that snapshot names on
either side do not change.

By making `createtxg` and `guid` part of the public ZFS interface,
scripts are enabled to use

  a) `createtxg` to determine the logical & temporal order of snapshots
     (the creation property is not an equivalent substitute since
      multiple snapshots may be created within one second)
  b) `guid` to uniquely identify a snapshot, independent of its current
      display name

This has the potential of making scripts safer and correct.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #6102
2017-05-09 15:36:53 -07:00
Chunwei Chen
e624cd1959 Linux 4.12 compat: PF_FSTRANS was removed
zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related
check, so we use them accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6113
2017-05-09 10:38:46 -07:00
Chunwei Chen
8f87971e1f Linux 4.12 compat: PF_FSTRANS was removed
Change SPL_FSTRANS to optionally contains PF_FSTRANS. Also, add
__spl_pf_fstrans_check for the checks specifically for PF_FSTRANS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #614
2017-05-09 10:36:54 -07:00
Brian Behlendorf
1eab430af7 Fix unused variable warning
Remove the lz4_ac local variable from dmu_write_policy() to resolve
the following unused variable warning on non-debug builds.

dmu.c: In function ‘dmu_write_policy’:
dmu.c:1892:12: warning: unused variable ‘lz4_ac’ [-Wunused-variable]
  boolean_t lz4_ac = spa_feature_is_active(os->os_spa,

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-05-05 10:23:58 -07:00
Gvozden Neskovic
c17486b217 Add missing *_destroy/*_fini calls
The proposed debugging enhancements in zfsonlinux/spl#587
identified the following missing *_destroy/*_fini calls.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5428
2017-05-04 19:26:28 -04:00
Brian Behlendorf
8fa5250f5d Default to zvol_request_async=0
Change the default ZVOL behavior so requests are handled asynchronously.
This behavior is functionally the same as in the zfs-0.6.4 release.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5902
2017-05-04 18:01:50 -04:00
Richard Yao
bc17f1047a Enable Linux read-ahead for a single page on ZVOLs
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.

Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #5902
2017-05-04 18:00:27 -04:00
RageLtMan
5731140eaf Disable write merging on ZVOLs
The current ZVOL implementation does not explicitly set merge
options on ZVOL device queues, which results in the default merge
behavior.

Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
ZIO pipeline to do its work.

Initial benchmarks (tiotest with no O_DIRECT) show random write
performance going up almost 3X on 8K ZVOLs, even after significant
rewrites of the logical space allocation.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Issue #5902
2017-05-04 17:59:52 -04:00
LOLi
dddef7d600 More ashift improvements
This commit allow higher ashift values (up to 16) in 'zpool create'

The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.

Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.

Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.

Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2024 
Closes #4205 
Closes #4740 
Closes #5763
2017-05-03 09:31:05 -07:00
Olaf Faaland
9d3f7b8791 Write label 2,3 uberblocks when vdev expands
When vdev_psize increases, the location of labels 2 and 3 changes
because their location is relative to the end of the device.

The configs for labels 2 and 3 are written during the next spa_sync()
because the vdev is added to the dirty config list.  However, the
uberblock rings are not re-written in their new location, leaving the
device vulnerable to the beginning of the device being overwritten or
damaged.

This patch copies the uberblock ring from label 0 to labels 2 and 3,
in their new locations, at the next sync after vdev_psize increases.

Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks
are copied.

Reviewed-by: BearBabyLiu <liu.huang@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5108
2017-05-02 13:55:24 -07:00
Debabrata Banerjee
03b60eee78 Allow scaling of arc in proportion to pagecache
When multiple filesystems are in use, memory pressure causes arc_cache
to collapse to a minimum. Allow arc_cache to maintain proportional size
even when hit rates are disproportionate. We do this only via evictable
size from the kernel shrinker, thus it's only in effect under memory
pressure.

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #6035
2017-05-02 15:50:49 -04:00
Debabrata Banerjee
4149bf498a Correct signed operation
Could return the wrong pages value

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:50:26 -04:00
Debabrata Banerjee
44813aefad Don't run the reaper if we didn't shrink the cache
Calling it when nothing is evictable will cause extra kswapd cpu. Also
if we didn't shrink it's unlikely to have memory to reap because we
likely just called it microseconds ago. The exception is if we are in
direct reclaim.

You can see how hard this is being hit in kswapd with a light test
workload:

  34.95%  [zfs]             [k] arc_kmem_reap_now
   5.40%  [spl]             [k] spl_kmem_cache_reap_now
   3.79%  [kernel]          [k] _raw_spin_lock
   2.86%  [spl]             [k] __spl_kmem_cache_generic_shrinker.isra.7
   2.70%  [kernel]          [k] shrink_slab.part.37
   1.93%  [kernel]          [k] isolate_lru_pages.isra.43
   1.55%  [kernel]          [k] __wake_up_bit
   1.20%  [kernel]          [k] super_cache_count
   1.20%  [kernel]          [k] __radix_tree_lookup

With ZFS just mounted but only ext4/pagecache memory pressure
arc_kmem_reap_now still consumes excessive CPU:

  12.69%  [kernel]  [k] isolate_lru_pages.isra.43
  10.76%  [kernel]  [k] free_pcppages_bulk
   7.98%  [kernel]  [k] drop_buffers
   7.31%  [kernel]  [k] shrink_page_list
   6.44%  [zfs]     [k] arc_kmem_reap_now
   4.19%  [kernel]  [k] free_hot_cold_page
   4.00%  [kernel]  [k] __slab_free
   3.95%  [kernel]  [k] __isolate_lru_page
   3.09%  [kernel]  [k] __radix_tree_lookup

Same pagecache only workload as above with this patch series:

  11.58%  [kernel]  [k] isolate_lru_pages.isra.43
  11.20%  [kernel]  [k] drop_buffers
   9.67%  [kernel]  [k] free_pcppages_bulk
   8.44%  [kernel]  [k] shrink_page_list
   4.86%  [kernel]  [k] __isolate_lru_page
   4.43%  [kernel]  [k] free_hot_cold_page
   4.00%  [kernel]  [k] __slab_free
   3.44%  [kernel]  [k] __radix_tree_lookup

   (arc_kmem_reap_now has 0 samples in perf)

AKAMAI: zfs: CR 3695042
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:50:13 -04:00
Debabrata Banerjee
1a31dcf53c Only wakeup waiters if we've actually done work
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:50:02 -04:00
Debabrata Banerjee
2e91c2fb1a Do not stop kernel shrinker on lock contention
Lock contention, by itself, shouldn't indicate a stop condition to the
kernel's slab shrinker. Doing so can cause stalls when the kernel is
trying to free large parts of the cache such as is done by drop_caches

Also, perhaps arc_reclaim_lock should be a spinlock, and this code
eliminated.

AKAMAI: zfs: CR 3593801
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:49:48 -04:00
Debabrata Banerjee
b855550c33 Stop double reclaiming or not reclaiming at all
Move arcstat_need_free increment from all direct calls to when
arc_reclaim_lock is busy and we exit wihout doing anything. Data will
be reclaimed in reclaim thread. The previous location meant that we
both reclaim the memory in this thread, and also schedule the same
amount of memory for reclaim in arc_reclaim, effectively doubling the
requested reclaim.

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:49:36 -04:00
Debabrata Banerjee
30fffb9021 Make arc_need_free updates atomic
Ensures proper accounting of bytes we requested to free

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:48:49 -04:00
Debabrata Banerjee
9b50146dc4 Don't report ghost buffers as evictable mem
Ghost meta/data buffers are not actually allocated

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
2017-05-02 15:47:23 -04:00
jxiong
2b91b5119c minor improvement to abd_free_pages()
It doesn't need to have a loop to free page in a single scatterlist
entry because it should be single or compound page. The pages can be
freed in one invocation to __free_pages() for both cases.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Closes #6057
2017-05-02 10:06:18 -07:00
jxiong
24fa20340d Guarantee PAGESIZE alignment for large zio buffers
In current implementation, only zio buffers in 16KB and bigger are
guaranteed PAGESIZE alignment. This breaks Lustre since it assumes
that 'arc_buf_t::b_data' must be page aligned when zio buffers are
greater than or equal to PAGESIZE.

This patch will make the zio buffers to be PAGESIZE aligned when
the sizes are not less than PAGESIZE.

This change may cause a little bit memory waste but that should be
fine because after ABD is introduced, zio buffers are used to hold
data temporarily and live in memory for a short while.

Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes #6084
2017-05-02 10:04:30 -07:00
Brian Behlendorf
7dae2c81e7 Linux 4.12 compat: super_setup_bdi_name()
All filesystems were converted to dynamically allocated BDIs.  The
destruction of backing_dev_info structures is handled as part of
super block destruction.  Refactor the code to abstract away the
details of creating and destroying a BDI.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6089
2017-05-02 09:46:18 -07:00
Yuri Pankov
153b228554 OpenZFS 7786 - zfs`vdev_online() needs better notification about state changes
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: bunder2015 <omfgbunder@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/7786
OpenZFS-commit: http://github.com/openzfs/openzfs/commit/db8498f
Closes #6074
2017-05-01 16:24:37 -04:00
Brian Behlendorf
e99932f7de Limit zfs_dirty_data_max_max to 4G
Reinstate default 4G zfs_dirty_data_max_max limit.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6072
Closes #6081
2017-05-01 13:01:39 -07:00
Chunwei Chen
692e55b8fe Reinstate zvol_taskq to fix aio on zvol
Commit 37f9dac removed the zvol_taskq for processing zvol requests.
This was removed as part of switching to make_request_fn and was
motivated by a concern at the time over dispatch latency.

However, this also made all bio request synchronous, and caused
serious performance issues as the bio submitter would wait for
every bio it submitted, effectively making the IO depth 1.

This patch reinstate zvol_taskq, and to make sure overlapped I/Os
are ordered properly, we take range lock in zvol_request, and pass
it along with bio to the I/O functions zvol_{write,discard,read}.

In order to facilitate benchmarks a zvol_request_sync module
option was added to switch between sync and async request handling.
For the moment, the default behavior is synchronous but this is
likely to change pending additional testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5824
2017-04-26 13:54:40 -07:00
Dan Kimmel
a7004725d0 OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7628 - create long versions of ZFS send / receive options

Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: bunder2015 <omfgbunder@gmail.com>
Ported-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Most of 7252 was already picked up during ABD work.  This
  commit represents the gap from the final commit to openzfs.
- Fixed split_large_blocks check in do_dump()
- An alternate version of the write_compressible() function was
  implemented for Linux which does not depend on fio.  The behavior
  of fio differs significantly based on the exact version.
- mkholes was replaced with truncate for Linux.

OpenZFS-issue: https://www.illumos.org/issues/7252
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294
Closes #6067
2017-04-26 12:31:43 -07:00
wli5
7a25f0891e Change U16 to U32 due to atomic_inc_32_nv
After run a long time with QAT compression, the variable "inst_num"
is overflow by "atomic_inc_32_nv", which causes its neighbor
variable overwritten. Change its definition from U16 to U32.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #6051
2017-04-25 17:41:58 -07:00
Matthew Ahrens
a004338372 OpenZFS 8025 - dbuf_read() creates unnecessary zio_root() for bonus buf
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

dbuf_read() creates a zio_root() to track and wait for all the zio's
that may happen as part of this call. However, if the blkptr_t for
this buffer is NULL or a hole, we will not create any more zio's, so
this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.

The fix is to only create/destroy the zio_root() in dbuf_read() if the
blkptr is not NULL and not a hole.

Porting Notes:
- The error handling for when dbuf_read_impl() fails which was
  originally added in commit 5f6d0b6f5 has been preserved.

OpenZFS-issue: https://www.illumos.org/issues/8025
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8ec5c7c
Closes #6048
2017-04-24 10:44:19 -07:00
dbavatar
6e03ec4fa2 Fix lseek result when dnode is dirty
Fixup commit 66aca24.  We should have equivalent return
values as generic_file_llseek() and advance to end of file.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #6050 
Closes #6053
2017-04-24 09:38:31 -07:00
Olaf Faaland
0091d66f4e Correct lock ASSERTs in vdev_label_read/write
The existing assertions in vdev_label_read() and vdev_label_write(),
testing which config locks are held, are incorrect. The assertions
test for locks which exceed what is required for safety.

Both vdev_label_{read,write}() are changed to assert SCL_STATE is held
as RW_READER or RW_WRITER. This is safe because:

Changes to the vdev tree occur under SCL_ALL as RW_WRITER, via
spa_vdev_enter() and spa_vdev_exit().

Changes to vdev state occur under SCL_STATE_ALL as RW_WRITER, via
spa_vdev_state_enter() and spa_vdev_state_exit().

Therefore, the new assertions guarantee that the vdev cannot change
out from under a zio, and I/O to a specified leaf vdev's label is
safe.

Furthermore, this is consistent with the SPA locking discussion in
spa_misc.c, "For any zio operation that takes an explicit vdev_t
argument ... zio_read_phys(), or zio_write_phys() ... SCL_STATE as
reader suffices."

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5983
2017-04-21 14:26:43 -07:00
DHE
06226b5936 Increase zfs_vdev_async_write_min_active to 2
Resilver operations frequently cause only a small amount of dirty data
to be written to disk at a time, resulting in the IO scheduler to only
issue 1 write at a time to the resilvering disk. When it is rotational
media the drive will often travel past the next sector to be written
before receiving a write command from ZFS, significantly delaying the
write of the next sector.

Raise zfs_vdev_async_write_min_active so that drives are kept fed
during resilvering.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #4825
Closes #5926
2017-04-14 14:03:44 -07:00
Matthew Ahrens
f6d4ce8e34 OpenZFS 8061 - sa_find_idx_tab can be declared more type-safely
Authored by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

sa_find_idx_tab() is declared as taking and returning "void *" parameters.
These can be declared to be the specific types.

OpenZFS-issue: https://www.illumos.org/issues/8061
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4e64aff
Closes #6017
2017-04-14 11:11:28 -07:00
Andriy Gapon
87a275d97a OpenZFS 6101 - attempt to lzc_create() a filesystem under a volume results in a panic
Authored by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

When querying ZPL properties verify that the objset is of type
DMU_OST_ZFS.

OpenZFS-issue: https://www.illumos.org/issues/6101
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce2243a
Closes #6015
2017-04-14 11:11:28 -07:00
Andriy Gapon
31b6bc74b9 OpenZFS 8026 - retire zfs_throttle_delay and zfs_throttle_resolution
Authored by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8026
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9b33e07
Closes #6014
2017-04-14 11:11:20 -07:00
Debabrata Banerjee
66aca24730 SEEK_HOLE should not block on txg_wait_synced()
Force flushing of txg's can be painfully slow when competing for disk
IO, since this is a process meant to execute asynchronously. Optimize
this path via allowing data/hole seeking if the file is clean, but if
dirty fall back to old logic. This is a compromise to disabling the
feature entirely.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #4306
Closes #5962
2017-04-13 10:51:20 -07:00
Brian Behlendorf
e550644f0c OpenZFS 5120 - zfs should allow large block/gzip/raidz boot pool (loader project)
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- grub-2.02-beta2-422-gcad5cc0 includes support for large blocks.
- Commit 8aab121 allowed GZIP[1-9].
- Grub allows pools with multiple top-level vdevs.

OpenZFS-issue: https://www.illumos.org/issues/5120
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd
Closes #6007
2017-04-13 09:40:00 -07:00
Brian Behlendorf
00481e7dad OpenZFS 7503 - zfs-test should tail ::zfs_dbgmsg on test failure
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Enable internal log for DEBUG builds and in zfs-tests.sh.
- callbacks/zfs_dbgmsg.ksh - Dump interal log via kstat.
- callbacks/zfs_dmesg.ksh - Dump dmesg log.
- default.cfg - 'Test Suite Specific Commands' dropped.

OpenZFS-issue: https://www.illumos.org/issues/7503
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/55a1300
Closes #6002
2017-04-12 13:36:48 -07:00
Giuseppe Di Natale
17b43f96f9 Skip rate limiting events in zfs_ereport_post
In zfs_ereport_post, if an event is a rate limiting
event, immediately return before any processing is done.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #5998
2017-04-11 18:37:45 -07:00
LOLi
047187c1bd Fix size inflation in spa_get_worst_case_asize()
When we try assign a new transaction to a TXG we must know beforehand
if there is sufficient free space on disk. This is to decide,
in dmu_tx_assign(), if we should reject the TX with ENOSPC.

We rely on spa_get_worst_case_asize() to inflate the size of our
logical writes by a factor of spa_asize_inflation which is
calculated as:

   (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2 == 24

The problem with the current implementation is that we don't take
into account what happens with very small writes on VDEVs with large
physical block sizes.
Consider the case of writes to a dataset with recordsize=512,
copies=3 on a VDEV with ashift=13 (usually SSD with 8K block size):
every logical IO will end up allocating 3 * 8K = 24K on disk, so 512
bytes multiplied by 48, which is double the size we account for.
If we allow this kind of writes to be assigned a TX it is possible,
when the pool is almost full, to trigger an allocation failure
(ENOSPC) in the ZIO pipeline, which will in turn result in the whole
pool being suspended.

The bug is fixed by using, in spa_get_worst_case_asize(), the MAX()
value chosen between the logical io size from zfs_write() and the
maximum physical block size used among our VDEVs.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #5941
2017-04-10 15:28:21 -07:00
Matthew Ahrens
8542ef852a OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations
Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Ported-by: Matt Ahrens <mahrens@delphix.com>

RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity.  Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4.  A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.

To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks.  The contents of these pad sectors are not used, so we do not
need to read or write these sectors.  However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors.  If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance.  The gap-filling performance improvement was introduced in
July 2009.

Writing the optional zio is done by the io aggregation code in
vdev_queue.c.  The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB.  For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.

The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks).  It
cannot occur with the default recordsize=128KB.  If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.

The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors).  Therefore
this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem".

The problem also occurs with the following configurations:

With recordsize=512KB or 256KB, compression=off, the problem occurs only
in rarely-used configurations:
* 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors)
* 4-wide RAIDZ2 (either recordsize, either ashift)
* 5-wide RAIDZ2 with recordsize=512KB (either ashift)
* 6-wide RAIDZ2 with recordsize=512KB (either ashift)

With recordsize=1MB, compression=off, ashift=9 (512B sectors)
* RAIDZ1 with 4 or 8 disks
* RAIDZ2 with 4, 8, or 10 disks
* RAIDZ3 with 6, 8, 9, or 10 disks

With recordsize=1MB, compression=off, ashift=12 (4KB sectors)
* RAIDZ1 with 7 or 8 disks
* RAIDZ2 with 4, 5, or 10 disks
* RAIDZ3 with 6, 9, or 10 disks

With recordsize=2MB and larger (which can only be selected by changing
kernel tunables), many configurations are affected, including with
higher numbers of disks (up to 18 disks with recordsize=2MB).

Increase zfs_vdev_aggregation_limit to allow the optional zio to be
aggregated, thus eliminating the problem.  Setting it to 256KB fixes all
commonly-used configurations.

The solution is to aggregate optional zio's regardless of the
aggregation size limit.

Analysis sponsored by Intel Corp.

OpenZFS-issue: https://www.illumos.org/issues/8005
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321
Closes #5931
2017-04-10 15:21:45 -07:00
Giuseppe Di Natale
42db43e982 OpenZFS 2932 - support crash dumps to raidz, etc. pools
Authored by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2932
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b
Closes #5984
Closes #5216
2017-04-10 10:24:17 -07:00
George Wilson
3b7f360c96 OpenZFS 8023 - Panic destroying a metaslab deferred range tree
Authored by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

We don't want to dirty any data when we're in the final txgs of the pool
export logic. This change introduces checks to make sure that no data is
dirtied after a certain point. It also addresses the culprit of this
specific bug – the space map cannot be upgraded when we're in final
stages of pool export. If we encounter a space map that wants to be
upgraded in this phase, then we simply ignore the request as it will get
retried the next time we set the fragmentation metric on that metaslab.

OpenZFS-issue: https://www.illumos.org/issues/8023
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ef00f5
Closes #5991
2017-04-09 16:12:35 -07:00
Andriy Gapon
c0c8cc7b43 OpenZFS 8027 - tighten up dsl_pool_dirty_delta
Authored by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8027
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/642668d
Closes #5988
2017-04-09 16:04:35 -07:00
Don Brady
177c91d06e Fix regression in zfs_ereport_start()
On 32-bit platforms spa_state is 32 bits without cast, and thus
caused a NULL pointer dereference when treated as 64bit in
var arg.  Accidentally introduced by bcdb96a.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #5966 
Closes #5965
2017-04-05 14:24:26 -07:00
Steven Hartland
2e215fecbe OpenZFS 7885 - zpool list can report 16.0e for expandsz
Authored by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>

When a member of a RAIDZ has been replaced with a device smaller than
the original, then the top level vdev can report its expand size as
16.0E.
The reduced child asize causes the RAIDZ to have a vdev_asize lower than
its vdev_max_asize which then results in an underflow during the
calculation of the parents expand size.
Fix this by updating the vdev_asize if it shrinks, which is already
protected by a check against vdev_min_asize so should always be safe.
Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize
is always greater than the parents vdev_min_size.

Reviewed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7885
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bb0dbaa
Closes #5963
2017-04-05 09:33:20 -07:00
N Clark
bcdb96a3e1 Additional Information for Zedlets
* Add ZPOOL pool state to zfs_post_common to
  allow differentiation between export and destroy
  by zedlets.

* Add pool name as standard export  This ensures
  pool name is exported to zedlets.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Closes #5942
2017-04-03 14:23:02 -07:00
Gvozden Neskovic
84c07adadb Remove dependency on linear ABD
Wherever possible it's best to avoid depending on a linear ABD.
Update the code accordingly in the following areas.

- vdev_raidz
- zio, zio_checksum
- zfs_fm
- change abd_alloc_for_io() to use abd_alloc()

Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5668
2017-03-29 12:24:51 -07:00
LOLi
ff61d1a495 Check ashift validity in 'zpool add'
df83110 added the ability to specify a custom "ashift" value from the command
line in 'zpool add' and 'zpool attach'. This commit adds additional checks to
the provided ashift to prevent invalid values from being used, which could
result in disastrous consequences for the whole pool.

Additionally provide ASHIFT_MAX and ASHIFT_MIN definitions in spa.h.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #5878
2017-03-28 17:21:11 -07:00
Chunwei Chen
12aec7dcd9 Fix wrong offset args in vdev_cache_write
The offset arguments is wrong when changing to abd_copy_off in a6255b7

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5932 
Closes #5936
2017-03-28 11:06:22 -07:00
Brian Behlendorf
8d70398740 Retry zfs_znode_alloc() in zfs_mknode()
For historical reasons zfs_mknode() was written such that it could
never fail.  This poses a problem for Linux since zfs_znode_alloc()
could potentually failure due to low memory.  Handle this gracefully
by retrying zfs_znode_alloc() until it succeeds, direct reclaim
will eventually be able to allocate memory.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5535 
Closes #5908
2017-03-23 18:26:50 -07:00
George Wilson
55922e73b4 OpenZFS 3821 - Race in rollback, zil close, and zil flush
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/3821
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/43297f9
Closes #5905
2017-03-23 18:20:58 -07:00
wli5
6a9d635998 GZIP compression offloading with QAT accelerator
This patch implement the hardware accelerator method in GZIP compression
in ZFS. When the ZFS pool is enabled GZIP compression, the compression
API will be automatically transferred to the hardware accelerator to
free up CPU resource and speed up the compression time.

* To enable Intel QAT hardware acceleration in ZOL you need to have QAT
  hardware and the driver installed:
  * QAT hardware DH8950:
  http://ark.intel.com/products/79483/Intel-QuickAssist-Adapter-8950
  * QAT driver:
  https://01.org/intel-quickassist-technology
* Start QAT driver in your system:
  service qat_service start
* Enable QAT in ZFS, e.g.:
  ./configure --with-qat=<qat-driver-path>/QAT1.6
  make
* Set GZIP compression in ZFS dataset:
  zfs set compression = gzip <dataset>
* Get QAT hardware statistics by:
  cat /proc/spl/kstat/zfs/qat
* To disable QAT in ZFS:
  insmod zfs.ko zfs_qat_disable=1

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #5846
2017-03-22 17:58:47 -07:00
Matthew Ahrens
64fc776208 OpenZFS 7968 - multi-threaded spa_sync()
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

spa_sync() iterates over all the dirty dnodes and processes each of them
by calling dnode_sync(). If there are many dirty dnodes (e.g. because we
created or removed a lot of files), the single thread of spa_sync()
calling dnode_sync() can become a bottleneck. Additionally, if many
dnodes are dirtied concurrently in open context (e.g. due to concurrent
file creation), the os_lock will experience lock contention via
dnode_setdirty().

The solution is to track dirty dnodes on a multilist_t, and for
spa_sync() to use separate threads to process each of the sublists in
the multilist.

OpenZFS-issue: https://www.illumos.org/issues/7968
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4a2a54c
Closes #5752
2017-03-20 18:36:00 -07:00
Olaf Faaland
a3478c0747 Linux 4.11 compat: iops.getattr and friends
In torvalds/linux@a528d35, there are changes to the getattr family of functions,
struct kstat, and the interface of inode_operations .getattr.

The inode_operations .getattr and simple_getattr() interface changed to:

int (*getattr) (const struct path *, struct dentry *, struct kstat *,
    u32 request_mask, unsigned int query_flags)

The request_mask argument indicates which field(s) the caller intends to use.
Fields the caller has not specified via request_mask may be set in the returned
struct anyway, but their values may be approximate.

The query_flags argument indicates whether the filesystem must update
the attributes from the backing store.

Currently both fields are ignored.  It is possible that getattr-related
functions within zfs could be optimized based on the request_mask.

struct kstat includes new fields:
u32               result_mask;  /* What fields the user got */
u64               attributes;   /* See STATX_ATTR_* flags */
struct timespec   btime;        /* File creation time */

Fields attribute and btime are cleared; the result_mask reflects this.  These
appear to be optional based on simple_getattr() and vfs_getattr() within the
kernel, which take the same approach.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5875
2017-03-20 17:51:16 -07:00
Olaf Faaland
bf8abea4da Linux 4.11 compat: remove stub for __put_task_struct
Before kernel 2.6.29 credentials were embedded in task_structs, and zfs had
cases where one thread would need to refer to the credential of another thread,
forcing it to take a hold on the foreign thread's task_struct to ensure it was
not freed.

Since 2.6.29, the credential has been moved out of the task_struct into a
cred_t.

In addition, the mainline kernel originally did not export __put_task_struct()
but the RHEL5 kernel did, according to zfsonlinux/spl@e811949a57.  As of
2.6.39 the mainline kernel exports it.

There is no longer zfs code that takes or releases holds on a task_struct, and
so there is no longer any reference to __put_task_struct().

This affects the linux 4.11 kernel because the prototype for
__put_task_struct() is in a new include file (linux/sched/task.h) and so the
config check failed to detect the exported symbol.

Removing the unnecessary stub and corresponding config check.  This works on
kernels since the oldest one currently supported, 2.6.32 as shipped with
Centos/RHEL.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608
2017-03-20 17:43:45 -07:00
Olaf Faaland
94b1ab2ae0 Linux 4.11 compat: vfs_getattr() takes 4 args
There are changes to vfs_getattr() in torvalds/linux@a528d35.  The new
interface is:

int vfs_getattr(const struct path *path, struct kstat *stat,
               u32 request_mask, unsigned int query_flags)

The request_mask argument indicates which field(s) the caller intends to
use.  Fields the caller does not specify via request_mask may be set in
the returned struct anyway, but their values may be approximate.

The query_flags argument indicates whether the filesystem must update
the attributes from the backing store.

This patch uses the query_flags which result in vfs_getattr behaving the same
as it did with the 2-argument version which the kernel provided before
Linux 4.11.

Members blksize and blocks are now always the same size regardless of
arch.  They match the size of the equivalent members in vnode_t.

The configure checks are modified to ensure that the appropriate
vfs_getattr() interface is used.

A more complete fix, removing the ZFS dependency on vfs_getattr()
entirely, is deferred as it is a much larger project.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608
2017-03-20 17:43:39 -07:00
Matthew Ahrens
9522bd2429 OpenZFS 7801 - add more by-dnode routines (lint)
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7801
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f25efb3
Closes #5894
2017-03-20 12:33:17 -07:00
Matthew Ahrens
8614ddf9b4 OpenZFS 6874 - rollback and receive need to reset ZPL state to what's on disk
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the
ZPL doesn't completely reload the state from the DMU; some values remain
cached in the zfsvfs_t.

OpenZFS-issue: https://www.illumos.org/issues/6874
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fdcbd0
Closes #5888
2017-03-13 17:42:42 -07:00
Brian Behlendorf
1c2555ef92 Restructure mount option handling
Restructure the handling of mount options to be consistent with
upstream OpenZFS.  This required making the following changes.

- The zfs_mntopts_t was renamed vfs_t and adjusted to provide
  the minimal needed functionality.  This includes a pointer
  back to the associated zfsvfs_t.  Plus it made it possible
  to revert zfs_register_callbacks() and zfsvfs_create() back
  to their original prototypes.

- A zfs_mnt_t structure was added for the sole purpose of
  providing a structure to pass the osname and raw mount
  pointer to zfs_domount() without having to copy them.

- Mount option parsing was moved down from the zpl_* wrapper
  functions in to the zfs_* functions.  This allowed for the
  code to be simplied and it's where similar functionality
  appears on other platforms.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-03-10 09:51:41 -08:00
Brian Behlendorf
f298b24ddf Rename zfs_* functions
Several functions were renamed when ZFS was originally ported to
Linux.  Revert the code to the original names to minimize the
delta with upstream OpenZFS.

  zfs_sb_teardown -> zfsvfs_teardown
  zfs_sb_create -> zfsvfs_create
  zfs_sb_setup -> zfsvfs_setup
  zfs_sb_free -> zfsvfs_free
  get_zfs_sb -> getzfsvfs
  zfs_sb_hold -> zfsvfs_hold
  zfs_sb_rele -> zfsvfs_rele

  zfs_sb_prune_aliases  -> zfs_prune_aliases (Linux-only)
  zfs_sb_prune -> zfs_prune (Linux only)

Align the zfs_vnops.h and zfs_vfsops.h with upstream as much
as possible.  Several prototypes were removed and those that
remain were reordered.

Move the EXPORT_SYMBOL lines to the end of the source files
for consistency with the other source files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-03-10 09:51:35 -08:00
Brian Behlendorf
0037b49e83 Rename zfs_sb_t -> zfsvfs_t
The use of zfs_sb_t instead of zfsvfs_t results in unnecessary
conflicts with the upstream source.  Change all instances of
zfs_sb_t to zfsvfs_t including updating the variables names.

Whenever possible the code was updated to be consistent with
hope it appears in the upstream OpenZFS source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-03-10 09:51:33 -08:00
Brian Behlendorf
ef1bdf363c Fix ZVOL BLKFLSBUF ioctl
The BLKFLSBUF ioctl is expected to do two things:

  - flush dirty pages to stable storage, and
  - invalidate clean pages

Unfortunately, the existing implementation of BLKFLSBUF in
zvol_ioctl() only flushes pages which are part of the current
TXG to disk.  There may be additional dirty pages in the
page cache which haven't yet been submitted to the DMU and
therefore aren't part of any TXG.

Furthermore because zvol_ioctl() returns 0 the generic
blkdev_flushbuf() does not invalidate the page cache.

Resolve the issue by moving bdev_flush() in to zvol_ioctl()
and explicitly waiting for a full TXG sync.  Then invalidate
the page cache.  The associated ARC buffers need not be
evicted since they cannot be bypassed using O_DIRECT.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5871 
Closes #5879
2017-03-09 17:43:36 -08:00
Giuseppe Di Natale
589bb918ef Suppress cppcheck nullPointer error in zfs_write
Newer versions of cppcheck find the potential NULL pointer
bug in zfs_write(). The function is difficult to refactor without
extensive work, so suppress the potential NULL pointer error
which cannot occur for now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #5882
2017-03-09 17:40:21 -08:00
Chunwei Chen
9b77d1c958 Fix nfs snapdir automount
The current implementation for allowing nfs to access snapdir is very buggy.
It uses a special fh for snapdirs, such that the next time nfsd does
fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd
never knows it cross a mountpoint.

The problem is that nfsd will not hold a reference on the vfsmount of the
snapshot. This cause auto unmounter to unmount the snapshot even though nfs is
still holding dentries in it.

To fix this, we return the inode for the snapdirs themselves. However, we also
trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate
and see the mountpoint and do crossmnt.

Because nfsd will now be aware that these are different filesystems users
must add crossmnt to their export options to access snapshot directories.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3794
Closes #4716
Closes #5810 
Closes #5833
2017-03-08 09:26:33 -08:00
Andriy Gapon
423e7b6261 OpenZFS 7867 - ARC space accounting leak
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7867
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aa1f740d
Closes #5874
2017-03-07 18:14:32 -08:00
Gvozden Neskovic
650383f283 [icp] fpu and asm cleanup for linux
Properly annotate functions and data section so that objtool does not complain
when CONFIG_STACK_VALIDATION and CONFIG_FRAME_POINTER are enabled.

Pass KERNELCPPFLAGS to assembler.

Use kfpu_begin()/kfpu_end() to protect SIMD regions in Linux kernel.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5872 
Closes #5041
2017-03-07 12:59:31 -08:00
Brian Behlendorf
3ec3bc2167 OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.

The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).

This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().

The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.

We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).

OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804 

Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
  Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
  applied to OpenZFS.
2017-03-07 09:51:59 -08:00
Brian Behlendorf
e2fcb56275 OpenZFS 7843 - get_clones_stat() is suboptimal for lots of clones
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7843
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4d519e7
Closes #5868
2017-03-07 09:47:40 -08:00
Chunwei Chen
7a789346af Fix loop device becomes read-only
Commit 933ec99 removes read and write from f_op because the vfs layer will
select iter_write or aio_write automatically. However, for Linux <= 4.0,
loop_set_fd will actually check f_op->write and set read-only if not exists.
This patch add them back and use the generic do_sync_{read,write} for
aio_{read,write} and new_sync_{read,write} for {read,write}_iter.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5776 
Closes #5855
2017-03-06 09:20:20 -08:00
Olaf Faaland
4859fe796c Linux 4.11 compat: avoid refcount_t name conflict
Linux 4.11 introduces a new type, refcount_t, which conflicts with the
type of the same name defined within ZFS.

Rename the ZFS type zfs_refcount_t.  Within the ZFS code, use a macro to
cause references to refcount_t to be changed to zfs_refcount_t at
compile time.  This reduces conflicts when later landing OpenZFS
patches.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5823 
Closes #5842
2017-02-28 16:10:18 -08:00
Matthew Ahrens
66eead53c9 Clean up by-dnode code in dmu_tx.c
0eef1bde31
introduced some changes which we slightly improved the style of when
porting to illumos.

There is also one minor error-handling fix, in zap_add() the "zap" may
become NULL in case of an error re-opening the ZAP.

Originally suggested at: https://github.com/openzfs/openzfs/pull/276

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5805
2017-02-24 13:34:26 -08:00
Isaac Huang
f7e76821c5 ABD style cleanups
The commit a6255b7fce removed a few
assertions which help catch errors and improve code readability. It also
duplicated two conditionals, which was unnecessary and made the code
confusing to read. This patch cleans it up.

Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5802
2017-02-24 12:05:42 -08:00
Daniel Hoffman
9e2c3bb4b9 OpenZFS 7812 - Remove gender specific language
Authored by: Daniel Hoffman <dj.hoffman@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

This change removes all gendered language that did not refer specifically
to an individual person or pet. The convention taken was to use
variations on "they" when referring to users and/or human beings, while
using "it" when referring to code, functions, and/or libraries.
Additionally, we took the liberty to fix up any whitespace issues that
were found in any files that were already being modified.

OpenZFS-issue: https://www.illumos.org/issues/7812
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ad626db
Closes #5822
2017-02-24 11:07:04 -08:00
Andriy Gapon
0efd97912a OpenZFS 7199 - dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7199
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bfaed0b
Closes #5817
2017-02-24 11:05:33 -08:00
Isaac Huang
6d82f98c3d Fix incorrect spare vdev state after replacing
After a hot spare replaces an OFFLINE vdev, the new
parent spare vdev state is set incorrectly to OFFLINE.
The correct state should be DEGRADED. The incorrect
OFFLINE state will prevent top-level vdev from reading
the spare vdev, thus causing unnecessary reconstruction.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5766 
Closes #5770
2017-02-23 10:32:15 -08:00
Olaf Faaland
8d5feecacf Linux 4.11 compat: set_task_state() removed
Replace uses of set_task_state(current, STATE) with
set_current_state(STATE).

In Linux 4.11, torvalds/linux@642fa44, set_task_state() is removed.

All spl uses are of the form set_task_state(current, STATE).
set_current_state(STATE) is equivalent and has been available since
Linux 2.2.26.

Furthermore, set_current_state(STATE) is already used in about 15
locations within spl.  This change should have no impact other than
removing an unnecessary dependency.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #603
2017-02-23 09:52:08 -08:00
Matthew Ahrens
c30e58c462 zfs_arc_num_sublists_per_state should be common to all multilists
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.  This tuning may be overridden
on a per-multilist basis.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5764
2017-02-15 15:49:33 -08:00
Tim Chase
544596c59e Fix zfs_compressed_arc_enabled parameter description
A likely cut/paste error caused the description to be applied to
zfs_arc_average_blocksize.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5788
2017-02-13 10:59:05 -08:00
Chunwei Chen
d6df043c53 Fix off by one in zpl_lookup
Doing the following command would return success with zfs creating an orphan
object.

	touch $(for i in $(seq 256); do printf "n"; done)

The funny thing is that this will only work once for each directory, because
after upgraded to fzap, zfs_lookup would fail properly since it has additional
length check.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5768
2017-02-11 12:42:17 -08:00
Matthew Ahrens
d7958b4cda OpenZFS 7104 - increase indirect block size
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7104
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4b5c8e9
Closes #5679
2017-02-09 10:27:02 -08:00
Matthew Ahrens
df7eeccc75 panic in bpobj_space(): null pointer dereference
This is a race condition in the deadlist code.

A thread executing an administrative command that uses
dsl_deadlist_space_range() holds the lock of the whole deadlist_t to
protect the access of all its entries that the deadlist contains in an
avl tree.

Sync threads trying to insert a new entry in the deadlist (through
dsl_deadlist_insert() -> dle_enqueue()) do not hold the deadlist lock at
that moment.  If the dle_bpobj is the empty bpobj (our sentinel value),
we close and reopen it.  Between these two operations, it is possible
for the dsl_deadlist_space_range() thread to dereference that bpobj
which is NULL during that window.

Threads should hold the a deadlist's dl_lock when they manipulate its
internal data so scenarios like the one above are avoided.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5762
2017-02-09 10:19:12 -08:00
Brian Behlendorf
ea7e86d8db Fix iput() calls within a tx
As explicitly stated in section 2 of the 'Programming rules'
comments at the top of zfs_vnops.c.

  If you must call iput() within a tx then use zfs_iput_async().

Move iput() calls after dmu_tx_commit() / dmu_tx_abort when
possible.  When not possible convert the iput() calls to
zfs_iput_async().

Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5758
2017-02-08 17:28:22 -08:00
Matthew Ahrens
4a5d7f8267 Allow c99 code to compile
Add the appropriate compiler flags to accept c99 code.  This will help to
minimize differences with upstream, and aid porting changes.  One change was
necessary in zvol.c because the DEFINE_IDA() macro does not work with the new
compiler flags.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5756
2017-02-08 09:27:48 -08:00
George Melikov
23d70cdef1 OpenZFS 6931 - lib/libzfs: cleanup gcc warnings
Authored by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6931
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/88f61de
Closes #5741
2017-02-07 14:02:27 -08:00
David Quigley
bef78122e6 Add missing module_param for zfs_per_txg_dirty_frees_percent
When the code was added this tunable was not exposed via module params. Also it
was not documented. This patch changes the type from a uint32 to a ulong as
done with other percentage tunables and also documents it in the
zfs-module-parameters man page.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #5750
2017-02-07 09:44:03 -08:00
George Melikov
298ec40b6d OpenZFS 7448 - ZFS doesn't notice when disk vdevs have no write cache
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7448
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/295438b
Closes #5737
2017-02-04 09:23:50 -08:00
George Melikov
0a252daed3 OpenZFS 7504 - kmem_reap hangs spa_sync and administrative tasks
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7504
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/405a5a0
Closes #5736
2017-02-04 09:21:25 -08:00
George Melikov
2e0e443ac4 OpenZFS 7247 - zfs receive of deduplicated stream fails
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7247
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ad25b4
Closes #5689 

Porting notes:
- tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_013_pos.ksh
  renamed as zfs_receive_015_pos.ksh, zfs_receive_013_pos.ksh is now
  used for OpenZFS test.
- libzfs_sendrecv.c: SMALLEST_POSSIBLE_MAX_DDT_MB is always used
  for all 32-bit builds.
2017-02-04 09:10:24 -08:00
George Melikov
9b7b9cd370 OpenZFS 1300 - filename normalization doesn't work for removes
Authored by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/1300
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f1750d
Closes #5725 

Porting notes:
- zap_micro.c: all `MT_EXACT` are replaced by `0`
2017-02-02 14:13:41 -08:00
Chunwei Chen
c7af63d62a Fix write(2) returns zero bug from 933ec99
For generic_write_checks with 2 args, we can exit when it returns zero because
it means count is zero. However this is not the case for generic_write_checks
with 4 args, where zero means no error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5720 
Closes #5726
2017-02-02 09:43:42 -08:00
Giuseppe Di Natale
fc386db191 Remove lint ifdef checks in zdb and dbuf
This is effectively dead code for the Linux implementation which can
be removed to improve readability.  We want to linter to check the
real production/debug build as much as possible.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #5722
2017-02-01 16:47:04 -08:00
George Melikov
0f676dc228 OpenZFS 7072 - zfs fails to expand if lun added when os is in shutdown state
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7072
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c39a2aa
Closes #5694 

Porting notes:
- vdev.c: 'vdev_get_stats' changes are moved to 'vdev_get_stats_ex'.
- vdev_disk.c: ignored, Linux specific code is different.
2017-02-01 13:14:02 -08:00
David Quigley
2fe36b0bfb Use fletcher_4 routines natively with abd_iterate_func()
This patch adds the necessary infrastructure for ABD to make use
of the vectorized fletcher 4 routines.

- export ABD compatible interface from fletcher_4
- add ABD fletcher_4 tests for data and metadata ABD types.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #5589
2017-02-01 09:34:22 -08:00
George Melikov
539d33c791 OpenZFS 6569 - large file delete can starve out write ops
Authored by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
Tested-by: kernelOfTruth <kerneloftruth@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/6569
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1bf4b6f2
Closes #5706
2017-01-31 14:44:03 -08:00
Giuseppe Di Natale
d69a321e56 OpenZFS 7545 - zdb should disable reference tracking
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes: Moved reference_tracking_enable and
reference_history outside of ZFS_DEBUG.

OpenZFS-issue: https://www.illumos.org/issues/7545
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4dd77f9
Closes #5701
2017-01-31 14:36:35 -08:00
Tim Chase
b81a3ddc32 Update deadman operation to better align with upstream OpenZFS
The deadman in ZoL didn't behave quite as it did in upstream
OpenZFS.  In addition to the 2 purposes for which OpenZFS used the
zfs_deadman_synctime_ms parameter, ZoL also used it to determine how
frequently the deadman would fire once it has been triggered.

This patch adds the zfs_deadman_checktime_ms parameter to control how
frequently the subsequent checks are performed.

The deadman is now disabled for suspended pools.

As had been the case, unlike upstream OpenZFS, ZoL will not panic when
a hung IO is detected.

The module parameter documentation has been upated to include the new
parameter and to better describe the operation of the deadmen.
    
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5695
2017-01-31 14:19:08 -08:00
Chunwei Chen
97048200f8 Use kernel slab for vn_cache and vn_file_cache
Resolve a false positive in the kmemleak checker by shifting to the
kernel slab.  It shows up because vn_file_cache is using KMC_KMEM
which is directly allocated using __get_free_pages, which is not
automatically tracked by kmemleak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #599
2017-01-31 13:44:01 -08:00
George Melikov
005e27e3b3 OpenZFS 7019 - zfsdev_ioctl skips secpolicy when FKIOCTL is set
Authored by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7019
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45b1747
Closes #5709
2017-01-31 10:24:23 -08:00
George Melikov
6325e48f95 OpenZFS 7136 - ESC_VDEV_REMOVE_AUX ought to always include vdev information
Authored by: Alan Somers <asomers@gmail.com>
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7136
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b72b6bb
Closes #5691 

Porting notes:
- Functionally this patch behaves the same as the OpenZFS
  version but it was adapted because because ZoL doesn't
  have the same illumos sysevent_t infrastructure and functionality.
2017-01-31 10:19:36 -08:00
George Melikov
41425f79da OpenZFS 7490 - real checksum errors are silenced when zinject is on
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7490
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6cedfc3
Closes #5693
2017-01-30 17:12:58 -08:00
George Melikov
e2da829cc1 OpenZFS 6922 - Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device
Authored by: Alan Somers <asomers@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6922
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/63364b0
Closes #5690
2017-01-30 15:33:46 -08:00
George Melikov
fa603f8233 OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's
Porting notes:
- 'zfs_dbgmsg_print()' reintroduced to userspace.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7277
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2f
Closes #5684
2017-01-28 12:16:43 -08:00
Brian Behlendorf
a32494d22a Fix suspend Godfather I/Os io_reexecute bits
After resuming a pool the godfather zio could have both the
ZIO_REEXECUTE_NOW and ZIO_REEXECUTE_SUSPEND bits set.  This
can occur if some child zios set ZIO_REEXECUTE_NOW while
other set ZIO_REEXECUTE_SUSPEND.  The godfather zio can
inherit both flags in zio_notify_parent().

The child zios which assigned the ZIO_REEXECUTE_SUSPEND flag
will be removed from the godfather's child list and added to
the spa->spa_suspend_zio_root child list.   While child zios
with the ZIO_REEXECUTE_NOW bit set remain being monitored
by the godfather zio.

When the godfather zio executes zio_done() the presence of
the ZIO_REEXECUTE_SUSPEND bit results in all io_reexecute
being cleared.  These child zios will then not be re-executed
and instead will be destroyed and lost.

The most straight forward way to address this situation is
to only clear the ZIO_REEXECUTE_SUSPEND bit and leave the
ZIO_REEXECUTE_NOW bit set.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: yuxiang <guo.yong33@zte.com.cn>
2017-01-28 12:13:34 -08:00
George Melikov
721ed0ee86 OpenZFS 7580 - ztest failure in dbuf_read_impl
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3105d95
Closes #5678
2017-01-28 12:11:09 -08:00
George Melikov
a08abc1bb3 OpenZFS 7301 - zpool export -f should be able to interrupt file freeing
Authored by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7301
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb72182
Closes #5680
2017-01-27 11:46:39 -08:00
George Melikov
cc9bb3e58e OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7254
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c166b69
Closes #5670
2017-01-27 11:43:42 -08:00
Chunwei Chen
933ec99951 Retire .write/.read file operations
The .write/.read file operations callbacks can be retired since
support for .read_iter/.write_iter and .aio_read/.aio_write has
been added.  The vfs_write()/vfs_read() entry functions will
select the correct interface for the kernel.  This is desirable
because all VFS write/read operations now rely on common code.

This change also add the generic write checks to make sure that
ulimits are enforced correctly on write.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5587 
Closes #5673
2017-01-27 10:43:39 -08:00
Brian Behlendorf
986dd8aacc OpenZFS 5561 - support root pools on EFI/GPT partitioned disks
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5561
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1a902ef
Closes #5672
2017-01-27 10:40:02 -08:00
Tim Chase
258553d3d7 OpenZFS 7613 - ms_freetree[4] is only used in syncing context
metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We
should replace it with two trees: the freeing tree (ranges that we are
freeing this syncing txg) and the freed tree (ranges which have been
freed this txg).

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7613
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8698da2
Closes #5598
2017-01-26 15:27:19 -08:00
George Melikov
9c9531cb6f OpenZFS 7500 - Simplify dbuf_free_range by removing dn_unlisted_l0_blkid
Authored by: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7500
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/653af1b
Closes #5639
2017-01-26 15:15:48 -08:00
George Melikov
39efbde7c5 OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS fsid change
Authored by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6676
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40510e8
Closes #5667
2017-01-26 14:43:28 -08:00
George Melikov
aeacdefedc OpenZFS 7386 - zfs get does not work properly with bookmarks
Authored by: Marcel Telka <marcel@telka.sk>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7386
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edb901a
Closes #5666
2017-01-26 14:42:15 -08:00
George Melikov
1149ba6478 OpenZFS 7606 - dmu_objset_find_dp() takes a long time while importing pool
When importing a pool with a large number of filesystems within the same
parent filesystem, we see that dmu_objset_find_dp() takes a long time.
It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
and spa_load_verify().

There are several ways to improve performance here:

    1. We don't really need to do spa_check_logs() or
       spa_ld_claim_log_blocks() if the pool was closed cleanly.

    2. spa_load_verify() uses dmu_objset_find_dp() to check that no
       datasets have too long of names.

    3. dmu_objset_find_dp() is slow because it's doing
       zap_value_search() (which is O(N sibling datasets)) to determine
       the name of each dsl_dir when it's opened. In this case we
       actually know the name when we are opening it, so we can provide
       it and avoid the lookup.

This change implements fix #3 from the above list; i.e. make
dmu_objset_find_dp() provide the name of the dataset so that we don't
have to search for it.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7606
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cac6bab
Closes #5662
2017-01-26 12:46:02 -08:00
George Melikov
ec923db25c OpenZFS 7180 - potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7180
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/690041b
Closes #5627
2017-01-23 10:53:46 -08:00
George Melikov
cffd6e1167 OpenZFS 3746 - ZRLs are racy
Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/3746
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/260af64
Closes #5625
2017-01-23 10:35:58 -08:00
George Melikov
911c41af2d OpenZFS 7304 - zfs filesystem/snapshot counts should be read-only
Authored by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7304
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/007a6c1
Closes #5624
2017-01-23 10:17:35 -08:00
George Melikov
f85c06bedf OpenZFS 7054 - dmu_tx_hold_t should use refcount_t to track space
Authored by: Igor Kozhukhov ikozhukhov@gmail.com
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/7054
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c779ad
Closes #5600
2017-01-23 09:36:24 -08:00
George Melikov
4ea3f86426 codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
clefru
2d4d81c485 Reimplement rt_mutex_owner to fix build with DEBUG & PREEMPT_RT_FULL
rt_mutex_owner is internal to kernel/locking/rtmutex_common.h and
inaccessible for SPL via the public kernel headers. The way of
accessing the owner has been stable since at least 3.13 ([1], [2]),
which is masking the lowest bit in the owner pointer in rt_mutex. We
do the same.

[1] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=3.13#L99
[2] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=4.9#L78

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes #593
2017-01-19 14:41:38 -08:00
George Melikov
5cb44271b4 Remove identical if statements in module/spl/spl-vnode.c
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #594
2017-01-19 14:32:45 -08:00
Chunwei Chen
040dab9939 Suspend/resume zvol for recv and rollback
When doing recv and rollback, dsl_dataset_clone_swap_sync_impl will be
called to swap out the ds_objset and do dmu_objset_evict on the old one.
However, currently zv->zv_objset will not be swapped out accordingly, so
if anyone currently holds a fd on the zvol, we risk hitting a use-after-free.

We fix this by introducing the suspend and resume mechanism of zsb to
zv.  Before recv or rollback, we use zvol_suspend to block all access to
zv_objset and shut it down. After the recv or rollback, we use zvol_resume
to swap in zv_objset with the new ds_objset and unblock the access.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4866 
Closes #5609
2017-01-19 13:56:36 -08:00
George Melikov
76fe529b39 OpenZFS 6529 - Properly handle updates of variably-sized SA entries
Porting notes:
- This issue was first fixed in ZoL by commit d862cb0d.  That fix was
then modified and an equivalent version of the patch landed in the
upstream code base.  For additional details see the discussion in
https://github.com/openzfs/openzfs/pull/24 .

This commit aligns ZoL with OpenZFS codebase.

Authored by: Andriy Gapon <avg@icyb.net.ua>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/6529
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e7e978b
Closes #5606
2017-01-19 13:50:22 -08:00
George Melikov
34a6b42844 OpenZFS 7659 - Missing thread_exit() in dmu_send.c
Two threads send_traverse_thread() and receive_writer_thread() should
end with thread_exit();

Mostly a cosmetic issue under IllumOS.

Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7659
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a569268
Closes #5603
2017-01-18 15:10:35 -08:00
George Melikov
7330fc57b7 OpenZFS 7235 - remove unused func dsl_dataset_set_blkptr
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bd56f80
Closes #5604
2017-01-17 15:22:56 -08:00
George Melikov
61ca48ff38 OpenZFS 7256 - low probability race in zfs_get_data
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7256
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6ed18a8
Closes #5601
2017-01-17 15:18:59 -08:00
George Melikov
e88551d52f OpenZFS 7071 - lzc_snapshot does not fill in errlist on ENOENT
Authored by: Igor Kozhukhov ikozhukhov@gmail.com
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7071
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/25f7d99
Closes #5597
2017-01-17 14:52:17 -08:00
George Melikov
cf7d1484bf OpenZFS 7082 - bptree_iterate() passes wrong args to zfs_dbgmsg()
Authored by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7082
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/10e67aa
Closes #5596
2017-01-17 14:49:24 -08:00
Brian Behlendorf
832805d951 OpenZFS 6586 - Whitespace inconsistencies in the spa feature dependency arrays in zfeature_common.c
Porting Notes:
- Preserved 'static const spa_feature_t hole_birth_deps[]'.

Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6586
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/22b6687
Closes #5592
2017-01-17 14:46:28 -08:00
Kevin Tanguy
0194e4a03c Add support for recent kmem_cache_create_usercopy
SLAB_USERCOPY flag was used to indicate PAX
not to kill copies from kernel to userland.

With recent grsecurity patchset and
CONFIG_GRKERNSEC_HIDESYM that enables
CONFIG_PAX_USERCOPY zfs would panic.

Handle newer API while keeping old one functional.

Tested-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: spendergrsec <spender@grsecurity.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kevin Tanguy <kevin.tanguy@ovh.net>
Closes #595
2017-01-17 12:05:14 -08:00
LOLi
08f0510d87 Fix unallocated object detection for large_dnode datasets
Fix dmu_object_next() to correctly handle unallocated objects on
large_dnode datasets.

We implement this by scanning the dnode block until we find the correct
offset to be used in dnode_next_offset(). This is necessary because we
can't assume *objectp is a hole even if dmu_object_info() returns
ENOENT.

This fixes a couple of issues with zfs receive on large_dnode datasets.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #5027 
Closes #5532
2017-01-13 15:47:34 -08:00
Brian Behlendorf
5043684ae5 OpenZFS 7603 - xuio_stat_wbuf_* should be declared (void)
Porting Notes:
- include/sys/dmu.h prototypes were already updated in 0bc8fd7

Authored by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7603
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/99aa8b5
Closes #5586
2017-01-13 15:33:14 -08:00
Brian Behlendorf
9775e98844 OpenZFS 7181 - race between zfs_mount and zfs_ioc_rollback
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7181
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90f2c09
Closes #5585
2017-01-13 15:29:32 -08:00
Jörg Thalheim
e254c8d8ee module/Makefile.in: use relative cp
Assuming /bin/cp causes problems on systems where cp is
not in /bin such as NixOS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Closes #5548
2017-01-13 15:18:34 -08:00
bzzz77
0eef1bde31 Add *_by-dnode routines
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by 
(objset_t *, uint64_t object).  This change converts some but
not all of the existing consumers.  As performance-sensitive
code paths are discovered they should be converted to use
these routines.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes #5534 
Issue #4802
2017-01-13 14:58:41 -08:00
RageLtMan
120faefed9 Update struct member intializers to C89
When building SPL within the kernel tree, C99 initializers cause
build failures and need to be converted to C89 as kernel CFLAGS
specify -std=gnu89.

This fix was provided by @behlendorf in #595 discussion notes and
manually implemented in the current master revision.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Closes #597
2017-01-13 14:12:42 -08:00
Don Brady
38640550f2 OpenZFS 7743 - per-vdev-zaps init path for upgrade
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Joe Stein <jas14@cs.brown.edu>
Ported-by: Don Brady <don.brady@intel.com>

When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.

This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.

The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.

OpenZFS-issue: https://www.illumos.org/issues/7743
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0
Closes #5582
2017-01-13 13:50:22 -08:00
George Melikov
0bc63d83f6 OpenZFS 6603 - zfeature_register() should verify ZFEATURE_FLAG_PER_DATASET implies SPA_FEATURE_EXTENSIBLE_DATASET
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6603
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0803e91
Closes #5573
2017-01-12 11:58:04 -08:00
Don Brady
4e21fd060a OpenZFS 7303 - dynamic metaslab selection
This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).

Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>

OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd
Closes #5404
2017-01-12 11:52:56 -08:00
George Melikov
e9aa730c49 OpenZFS 6328 - Fix cstyle errors in zfs codebase
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6328
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9a686fb
Closes #5579
2017-01-12 09:42:11 -08:00
ka7
4e33ba4c38 Fix spelling
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5547 
Closes #5543
2017-01-03 11:31:18 -06:00
Tim Chase
a5e046eaac 4.10 compat - BIO flag changes and others
[bio] The req_op enum was changed to req_opf.  Update the "Linux 4.8 API"
autotools checks to use an int to determine whether the various REQ_OP
values are defined.  This should work properly on kernels >= 4.8.

[bio] bio_set_op_attrs() is now an inline function and can't be detected
with #ifdef.  Add a configure check to determine whether bio_set_op_attrs()
is defined.  Move the local definition of it from vdev_disk.c to
blkdev_compat.h for consistency with other related compability shims.

[bio] The read/write flags and their modifiers, including WRITE_FLUSH,
WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h.  Add the new
bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA
and set the flags appropriately for each supported kernel version.

[vfs] The generic_readlink() function has been made static.  If .readlink
in inode_operations is NULL, generic_readlink() is used.

[zol typo] Completely unrelated to 4.10 compat, fix a typo in the check
for REQ_OP_SECURE_ERASE so that the proper macro is defined:

    s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5499
2016-12-30 16:03:59 -06:00
LOLi
3500a14595 Don't persist temporary pool name on devices
Fix a regression accidentally introduced by e0ab3ab.

Additionally, add a new script zpool_import_014_pos.ksh to
the ZFS test suite to exercise 'zpool import -t' functionality.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #5466 
Closes #5515
2016-12-22 10:39:00 -08:00
Chunwei Chen
da8f51e16a Use a dedicated taskq for vdev_file
The introduction of parallel zvol prefetch causes deadlock when using
vdev_file.

spa_async->(spa_namespace_lock)->txg_wait_synced->(wait for txg_sync)
txg_sync->zio_wait->(wait for vdev_file_io_fsync on system_taskq)
zvol_prefetch_minors_impl (on system_taskq)->spa_open_common->(wait for spa_namespace_lock)

We fix this by using dedicated taskq for vdev_file.  This same change
was originally made in commit bc25c93 but reverted in commit aa9af22
when dynamic taskqs were added.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #5506 
Closes #5495
2016-12-21 10:47:15 -08:00
LOLi
5f1346c299 Fix dsl_props_set_sync_impl to work with nested nvlist
When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't
preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to
process it nvpair_name() is always "value" and not the actual property name.

This fixes a couple of bugs in zfs_ioc_recv():
* Received properties were not restored correctly when failing to receive an
incremental send stream
* Received properties were not completely replaced by the new ones when
successfully receiving an incremental send stream

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #5497
2016-12-20 18:46:59 -08:00
Brian Behlendorf
a3823f428d Fix file attributes
This branch contains the following fixes/improvements.

* Fix setting i_flags
* Fix wrong operator in xvattr.h
* Fix fchange macro in zpl_ioctl_setflags()
* Added configure check to use inode_set_flags()
* Added a test case for chattr for better test coverage

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5486 
Closes #5470 
Closes #5469
2016-12-19 13:01:10 -08:00
Clemens Fruhwirth
8e99d66b05 Add support for rw semaphore under PREEMPT_RT_FULL
The main complication from the RT patch set is that the RW semaphore
locks change such that read locks on an rwsem can be taken only by
a single thread.  All other threads are locked out. This single
thread can take a read lock multiple times though. The underlying
implementation changes to a mutex with an additional read_depth
count.

The implementation can be best understood by inspecting the RT
patch.  rwsem_rt.h and rt.c give the best insight into how RT
rwsem works. My implementation for rwsem_tryupgrade is basically
an inversion of rt_downgrade_write found in rt.c. Please see the
comments in the code.

Unfortunately, I have to drop SPLAT rwlock test4 completely as this
test tries to take multiple locks from different threads, which RT
rwsems do not support.  Otherwise SPLAT, zconfig.sh, zpios-sanity.sh
and zfs-tests.sh pass on my Debian-testing VM with the kernel
linux-image-4.8.0-1-rt-amd64.

Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes zfsonlinux/zfs#5491
Closes #589
Closes #308
2016-12-19 12:45:24 -08:00
Chunwei Chen
6c01a4af2b Fix zmo leak when zfs_sb_create fails
zfs_sb_create would normally takes ownership of zmo, and it will be freed in
zfs_sb_free. However, when zfs_sb_create fails we need to explicit free it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5490 
Closes #5496
2016-12-19 09:46:29 -08:00
Chunwei Chen
a5248129b8 Use inode_set_flags when available
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-16 13:54:51 -08:00
Chunwei Chen
c360af5411 Fix fchange in zpl_ioctl_setflags
The fchange in zpl_ioctl_setflags was for detecting flag change. However it
was incorrect and would always fail to detect a flag change from set to unset,
causing users without CAP_LINUX_IMMUTABLE to be able to unset flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-16 12:46:46 -08:00
Gvozden Neskovic
0101796290 ABD: Adapt avx512bw raidz assembly
Adapt avx512bw implementation for use with abd buffers. Mul2 implementation
is rewritten to take advantage of the BW instruction set.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Romain Dolbeau <romain.dolbeau@atos.net>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5477
2016-12-15 17:31:33 -08:00
Chunwei Chen
9c9ad845ef Refactor some splat macro to function
Refactor the code by making splat_test_{init,fini}, splat_subsystem_{init,fini}
into functions. They don't have reason to be macro and it would be too bloated
to inline every call.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-15 11:30:11 -08:00
Chunwei Chen
71a3c9c45d Fix splat memleak
SPLAT_TEST_FINI didn't call kfree causing memleak.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-15 11:30:11 -08:00
Chunwei Chen
7bb1325f95 Fix i_flags issue caused by 64c688d
Fix zfs_xvattr_set to set S_IMMUTABLE and S_APPEND flags correctly.

Reinstate zfs_set_inode_flags and use it when zfs_xvatter_set and also when
setting up inode in zfs_znode_alloc and zfs_rezget.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-14 14:48:09 -08:00
Chunwei Chen
f2d8bdc62e Add ida_destroy in zvol_fini to fix memleak
User of ida needs to call ida_destroy after using it. Otherwise
ida->free_bitmap and/or other stuff may leak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5484
2016-12-14 09:41:39 -08:00
bunder2015
f974e25dc1 Fix typos in dbuf.c
This removes two large whitespaces in "modinfo zfs" as well as correcting
a couple typos.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #5475
2016-12-13 14:21:02 -08:00
Brian Behlendorf
02730c333c Use cstyle -cpP in make cstyle check
Enable picky cstyle checks and resolve the new warnings.  The vast
majority of the changes needed were to handle minor issues with
whitespace formatting.  This patch contains no functional changes.

Non-whitespace changes are as follows:

* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5465
2016-12-12 10:46:26 -08:00
Chunwei Chen
a806cb6a89 Don't count '@' for dataset namelen if not a snapshot
Don't count '@' for dataset namelen if not a snapshot.  This
fixes making a pool unimportable when the  dataset namelen
is 255.

Add test file for zfs create name length 255.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5432 
Closes #5456
2016-12-09 11:52:08 -07:00
luozhengzheng
c077090a9b Fix coverity defects: CID 154617
CID 154617: Memory - illegal accesses (UNINIT)

The value here just needs to be initialized to make Coverity happy.
When dsize == 0, then value of daiter.iter_mapaddr is irrelevant. That
address won't be accessed, it's only used for some arithmetic. dsize
can be zero either if dabd is null, or if code column is longer than the
current data column.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5437
2016-12-08 14:48:09 -07:00
Brian Behlendorf
f95e647891 Speed up zvol import and export speed
Speed up import and export speed by:

* Add system delay taskq
* Parallel prefetch zvol dnodes during zvol_create_minors
* Parallel zvol_free during zvol_remove_minors
* Reduce list linear search using ida and hash

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5433
2016-12-08 14:05:02 -07:00
Chunwei Chen
f200b83673 Add system_delay_taskq for long delay
Add a dedicated system_delay_taskq for long delay like spa_deadman and
zpl_posix_acl_free. This will allow us to use system_taskq in the manner of
dispatch multiple tasks and call taskq_wait_outstanding.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #588
2016-12-08 14:00:20 -07:00
Brian Behlendorf
27f2b90d3e Revert "Disable zio_dva_throttle_enabled by default"
Enable zio_dva_throttle_enabled=1 by default. Subsequent
testing has been unable to reproduce the suspected regression.

Tested-by: kernelOfTruth kerneloftruth@gmail.com
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov
Reverts #5335
Closes #5289
Closes #5457
2016-12-08 13:57:42 -07:00
Gvozden Neskovic
e8a2014436 Cache ddt_get_dedup_dspace() value if there was no ddt changes
Save and reuse ddt dspace calculation when there have been no ddt changes.
This avoids unnecessary traversal of 168KiB of ddt histograms.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5425
2016-12-02 16:59:35 -07:00
Brian Behlendorf
baf67d15a5 Refactor txg history kstat
It was observed that even when the txg history is disabled by
setting `zfs_txg_history=0` the txg_sync thread still fetches
the vdev stats unnecessarily.

This patch refactors the code such that vdev_get_stats() is no
longer called when `zfs_txg_history=0`.  And it further reduces
the  differences between upstream and the ZoL txg_sync_thread()
function.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5412
2016-12-02 16:57:49 -07:00
Chunwei Chen
899662e344 zvol_remove_minors do parallel zvol_free
On some kernel version, blk_cleanup_queue and put_disk will wait for more then
10ms. So a pool with a lot of zvols will easily wait for more then 1 min if we
do zvol_free sequentially.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Requires-spl: refs/pull/588/head
2016-12-02 11:13:35 -08:00
Chunwei Chen
7ac557cef1 zpool_create_minors parallel prefetch
Do parallel prefetch all zvol dnodes before actually creating each individual.
This will greatly reduce the import time when having a lot of zvols and disk
is slow.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-02 11:13:35 -08:00
Brian Behlendorf
b0319c1faa OpenZFS 7143 - dbuf_read() creates unnecessary zio_root() for bonus buf
dbuf_read() creates a zio_root() to track and wait for all the
zio's that may happen as part of this call. However, if the blkptr_t
for this buffer is NULL or a hole, we will not create any more zio's,
so this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.

The fix is to only create/destroy the zio_root() in dbuf_read() if
the blkptr is not NULL and not a hole.

Changes sponsored by Intel Corp.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by:  Alex Zhuravlev <alexey.zhuravlev@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs/openzfs#137
Closes #4803 
Closes #5382
2016-12-01 16:50:11 -07:00
luozhengzheng
ba712624d6 Fix incorrect operator in abd_alloc_sametype()
This should be & and not | so is_metadata is set correctly.

Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5438
2016-12-01 16:45:16 -07:00
cao
e2c7d3785a Remove unused sa_update_from_cb()
It looks like this was functionality which was added in the
original SA implementation and then never needed.  It can
be safely removed now and easily added back if we find a
use for it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5440
2016-12-01 16:39:06 -07:00
cao
6a8fd57fa7 Compile zio.h and zio_impl.h mutual include
zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the
header files to contain each other.  Get rid of the zio_impl.h include
in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5439
2016-12-01 16:36:25 -07:00
Chunwei Chen
d45e010dce zvol: reduce linear list search
Use kernel ida to generate minor number, and use hash table to find zvol with
name.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-01 14:53:11 -08:00
Chunwei Chen
57ddcda164 Use system_delay_taskq for long delay tasks
Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire.
This free system_taskq from the above long delay tasks, and allow us to do
taskq_wait_outstanding on system_taskq without being blocked forever, making
system_taskq more generic and useful.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-01 14:52:48 -08:00
Chunwei Chen
493492559e Limit number of tasks shown in taskq proc
To prevent holding tq_lock for too long.

Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq
would easily cause a lockup. While that bug has been fixed. It's probably
still a good idea to do this just in case task lists grow too large.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #586
2016-12-01 11:06:27 -07:00
Brian Behlendorf
a3fd9d9e15 Convert zio_buf_alloc() consumers
In multiple cases zio_buf_alloc() was used instead of kmem_alloc()
or vmem_alloc().  This was often done because the allocations
could be large and it was easy to use zfs_buf_alloc() for them.

But this isn't ideal for allocations which are small or short
lived.  In these cases it is better to use kmem_alloc() or
vmem_alloc().  If possible we want to avoid the case where
we have slabs allocated for kmem caches which are rarely used.

Note for small allocations vmem_alloc() will be internally
converted to kmem_alloc().  Therefore as long as large
allocations are infrequent and short lived the penalty for
using vmem_alloc() is small.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5409
2016-11-30 16:18:20 -07:00
Chunwei Chen
9829574834 ABD optimized page allocation code
* Convert ABD to use the Linux Kernel scatterlist implementation
  instead of the hand rolled one from illumos.

* Scatter ABDs are preferentially populated with higher order
  compound pages from a single zone.  Allocation size is
  progressively decreased until it can be satisfied without
  performing reclaim or compaction.

* An alternate page allocator is provided for kernels older
  than 3.6 and for CONFIG_HIGHMEM systems.  This allocator
  is designed as a fallback for maximum compatibility.

* Extended abdstats to provide visibility in the the allocator.

* Add cached value for PAGESIZE in userspace.

Contributions-by:
Chunwei Chen <david.chen@osnexus.com>
Gvozden Neskovic <neskovic@gmail.com>
Jinshan Xiong <jinshan.xiong@intel.com>
Isaac Huang <he.huang@intel.com>
David Quigley <david.quigley@intel.com>
Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-29 14:34:33 -08:00
Chunwei Chen
4f60152910 ABD kmap to kmap_atomic
Convert usage of kmap to kmap_atomic while correctly saving off
irq state.
2016-11-29 14:34:33 -08:00
Romain Dolbeau
88cc2352ea ABD raidz NEON support
Port NEON implementation of RAID-Z functions to ABD.

Signed-off-by: Roomain Dolbeau <romain.dolbeau@atos.net>
2016-11-29 14:34:33 -08:00
Gvozden Neskovic
65d71d4212 ABD raidz avx512f support
Implement shift based multiplication for 512f. Higher IPC over lookup based
methods yields up to 40% better performance on the current hardware.

Results on Xeon Phi(TM) CPU 7210:
implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr
original         142232671       24411492        12948205        283053705       22348167        4215911         9171609         2265548         2378370         1648495
scalar           295711162       49851491        33253815        293198109       88179448        61866752        27941684        25764416        17384442        12138153
sse2             410055998       199642658       117973654       406240463       152688682       121092250       84968180        79291076        47473657        20779719
ssse3            411641595       199669571       117937647       406211024       137638508       117050346       81263322        76120405        46281559        32696722
avx2             616485806       311515332       188595628       605455115       260602390       230554476       148198817       138800254       92273356        62937819
avx512f          832191523       408509425       253599522       810094481       404325734       317590971       218235687       197204920       133101937       94001219
fastest          avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f         avx512f

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-11-29 14:34:33 -08:00
Gvozden Neskovic
cbf484f8ad ABD Vectorized raidz
Enable vectorized raidz code on ABD buffers.  The avx512f,
avx512bw, neon and aarch64_neonx2 are disabled in this commit.
With the exception of avx512bw these implementations are
updated for ABD in the subsequent commits.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-11-29 14:34:33 -08:00
Gvozden Neskovic
a206522c4f ABD changes for vectorized RAIDZ
* userspace: aligned buffers. Minimum of 32B alignment is
  needed for AVX2. Kernel buffers are aligned 512B or more.
* add abd_get_offset_size() interface
* abd_iter_map(): fix calculation of iter_mapsize
* add abd_raidz_gen_iterate() and abd_raidz_rec_iterate()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-11-29 14:34:33 -08:00
Isaac Huang
b0be93e81a ABD page support to vdev_disk.c
Signed-off-by: Isaac Huang <he.huang@intel.com>
2016-11-29 14:34:32 -08:00
David Quigley
a6255b7fce DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
DeHackEd
7ca25051b6 Kernel 4.9 compat: file_operations->aio_fsync removal
Linux kernel commit 723c038475b78 removed this field.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #5393
2016-11-15 09:20:46 -08:00
luozhengzheng
32dec7bd1a Fix coverity defects: CID 147503
CID 147503: Dereference after null check (FORWARD_NULL)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5326
2016-11-10 08:50:32 -08:00
cao
3bfd95d589 Fix coverity defects: CID 147540, 147542
CID 147540: unsigned_compare
- Cast nsec to a int32_t to properly detect the expected overflow.
CID 147542: unsigned_compare
- intval can never be less than ZIO_FAILURE_MODE_WAIT which is
  defined to be zero.  Remove this useless check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5379
2016-11-09 17:35:26 -08:00
jxiong
126ae9f4e9 Export symbol dmu_objset_userobjspace_upgradable
It's used by Lustre to determine if the objset can be upgraded.
The inline version doesn't work because dmu_objset_is_snapshot()
is not exported.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes #5385
2016-11-09 13:51:12 -08:00
tuxoko
0420c126ce Linux 3.14 compat: assign inode->set_acl
Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come
from setxattr, which will handle by the acl xattr_handler, and we already
handles that well. However, nfsd will directly calls inode->set_acl or
return error if it doesn't exists.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Massimo Maggi <me@massimo-maggi.eu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5371 
Closes #5375
2016-11-09 10:37:17 -08:00
cao
a36cc8d242 Fix coverity defects: CID 147626, 147628
CID 147626: Type:Dereference before null check
CID 147628: Type:Dereference before null check

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5304
2016-11-08 14:28:17 -08:00
Don Brady
976246fadd Add illumos FMD ZFS logic to ZED -- phase 2
The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.

The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.

The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #5343
2016-11-07 15:01:38 -08:00
cao
f4bae2ed63 Fix coverity defects: CID 147575, 147577, 147578, 147579
CID 147575, Type:Unintentional integer overflow
CID 147577, Type:Unintentional integer overflow
CID 147578, Type:Unintentional integer overflow
CID 147579, Type:Unintentional integer overflow

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5365
2016-11-07 14:54:32 -08:00
Chunwei Chen
8e71ab99dc Batch free zpl_posix_acl_release
Currently every calls to zpl_posix_acl_release will schedule a delayed task,
and each delayed task will add a timer. This used to be fine except for
possibly bad performance impact.

However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In
this new implementation, the larger the delay, the less accuracy the timer is.
So when we have a flood of timer from zpl_posix_acl_release, they will expire
at the same time. Couple with the fact that task_expire will do linear search
with lock held. This causes an extreme amount of contention inside interrupt
and would actually lockup the system.

We fix this by doing batch free to prevent a flood of delayed task. Every call
to zpl_posix_acl_release will put the posix_acl to be freed on a lockless
list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free
every posix_acl that passed the grace period on the list. This way, we only
have one delayed task every second.

[1] https://lwn.net/Articles/646950/

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-07 11:04:44 -08:00
Brian Behlendorf
34328f3cf8 Allow 16M zio buffers in user space
Only restrict the maximum zio alloc size to 32-bit kernel space.
The same virtual address space limitations don't apply to user
space.  This resolves a memory allocation failure in raidz_test
where it expects to be able to exercises all valid zio sizes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-07 10:26:17 -08:00
Romain Dolbeau
7f3194932d Add superscalar fletcher4
This is the Fletcher4 algorithm implemented in pure C, but using
multiple counters using algorithms identical to those used for
SSE/NEON and AVX2.

This allows for faster execution on core with strong superscalar
capabilities but weak SIMD capabilities.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #5317
2016-11-04 10:53:03 -07:00
Chunwei Chen
ace1eae84c Add support for O_TMPFILE
Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on
supported filesystem. It's basically doing open(2) and unlink(2) atomically.

The filesystem support is added through i_op->tmpfile. We basically copy the
create operation except we get rid of the link and name related stuff and add
the new node to unlinked set.

We also add support for linkat(2) to link tmpfile. However, since all previous
file operation will skip ZIL, we force a txg_wait_synced to make sure we are
sync safe.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-04 10:46:40 -07:00
Chunwei Chen
987014903f Fix unlinked file cannot do xattr operations
Currently, doing things like fsetxattr(2) on an unlinked file will result in
ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget.

The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we
need it to not return error when the zp is unlinked. This is a pretty big
change in behavior, but skimming through all the callers, I don't think this
change would cause any problem. Also there's nothing preventing z_unlinked
from being set after the z_lock mutex is dropped before but before zfs_zget
returns anyway.

The rest of the stuff is to make sure we don't log xattr stuff when owner is
unlinked.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-04 10:46:40 -07:00
Romain Dolbeau
7f547f85fe Add parity generation/rebuild using AVX-512 for x86-64
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.

avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes #5219
2016-11-02 12:40:23 -07:00
BearBabyLiu
6d4210052b Fix dsl_prop_get_all_dsl() memory leak
On error dsl_prop_get_all_ds() does not free the nvlist it allocates.
This behavior may have been intentional when originally written
but is atypical and often confusing.  Since no callers rely on this
behavior the function has been updated to always free the nvlist
on error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn>
Closes #5320
2016-11-02 12:34:10 -07:00
Brian Behlendorf
9edb36954a Use vmem_size() for 32-bit systems
On 32-bit Linux systems use vmem_size() to correctly size the ARC
and better determine when IO should be throttle due to low memory.

On 64-bit systems this change has no effect since the virtual
address space available far exceeds the physical memory available.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
2016-11-02 12:14:45 -07:00
Brian Behlendorf
82ec9d41d8 Fix 32-bit maximum volume size
A limit of 1TB exists for zvols on 32-bit systems.  Update the code
to correctly reflect this limitation in a similar manor as the
OpenZFS implementation.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
2016-11-02 12:14:45 -07:00
Brian Behlendorf
4990e576c6 Enable .zfs/snapshot for 32-bit systems
Originally the .zfs/snapshot directory was disabled for 32-bit systems
because 64-bit inode numbers were not supported.  This is no longer
the case and this functionality can be enabled by default.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
Closes #2002
2016-11-02 12:14:45 -07:00
Brian Behlendorf
48d3eb40c7 Add TASKQID_INVALID
Add the TASKQID_INVALID macros and update callers to use the macro
instead of testing against 0.  There is no functional change
even though the functions in zfs_ctldir.c incorrectly used -1
instead of 0.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
2016-11-02 12:14:45 -07:00
Ubuntu
cbba714667 Add TASKQID_INVALID and TASKQID_INITIAL macros
Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the
taskq implementation and test cases to use them.  This is solely
for the purposes of readability and introduces no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-02 10:34:19 -07:00
Ubuntu
1b457bcbe5 Fix vmem_size()
Add a minimal implementation of vmem_size() which accounts for the
virtual memory usage of the SPL's kmem cache.  This functionality
is only useful on 32-bit systems with a small virtual address space.

The following assumptions are made:

  1) The major SPL consumer of virtual memory is the kmem cache.
  2) Memory allocated with vmem_alloc() is short lived and can be ignored.
  3) Allow a 4MB floor as a generous pad given normal consumption.
  4) The spl_kmem_cache_sem only contends with cache create/destroy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-02 10:34:19 -07:00
cao
51c9163f98 Fix sa_legacy_attr_count to use ARRAY_SIZE
Replace magic value 16 with ARRAY_SIZE() to correctly handle
when the sa_legacy_attrs array size changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5354
2016-11-02 10:26:12 -07:00
cao
981b21260e Fix coverity defects: CID 147553
CID 147553: Type:Dereference null return value

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5305
2016-11-01 10:20:24 -07:00
cao
b182ac00aa Fix coverity defects: CID 152975
CID 152975: Type:Dereference null return value

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5322
2016-10-31 16:23:56 -07:00
GeLiXin
4aafab91c5 Fix coverity defects: CID 147509
CID 147509: Explicit null dereferenced
- l2arc_sublist_lock is fragile as relied on caller too much.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes #5319
2016-10-31 16:04:01 -07:00
Hajo Möller
e02aaf17f1 Fix lookup_bdev() on Ubuntu
Ubuntu added support for checking inode permissions to lookup_bdev() in kernel
commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21).
Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517

This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and
calls the function in the correct way.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Closes #5336
2016-10-26 10:30:43 -07:00
Brian Behlendorf
76a87a902e Disable zio_dva_throttle_enabled by default
Until it can be determined definitively that a performance
regression wasn't introduced accidentally by 3dfb57a this
functionality is being disabled by default.  It can be re-
enabled by setting zio_dva_throttle_enabled=1.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5335 
Issue #5289
2016-10-26 09:13:43 -07:00
Tony Hutter
6568379eea Fix statechange-led.sh & unnecessary libdevmapper warning
- Fix autoreplace behaviour on statechange-led.sh script.

ZED sends the following events on an auto-replace:

1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE

Events 1-2 happen when ZED first attempts to do an auto-online.  When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.

In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED.  It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline.  This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.

This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL.  It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.

- Remove unnecessary libdevmapper warning with --with-config=kernel

This fixes an unnecessary libdevmapper warning when building
--with-config=kernel.  Kernel code does not use libdevmapper, so the warning
is not needed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2375 
Closes #5312 
Closes #5331
2016-10-25 11:05:30 -07:00
Jason Zaman
402c7c27b0 icp: mark asm files with noexec stack
Similar to commit a3600a106.  Asm files need an explicit note
that they do not require an executable stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason Zaman <jason@perfinion.com>
Closes #5332
2016-10-25 10:44:09 -07:00
tuxoko
9fa4db44b7 Fix cred leak in zpl_fallocate_common
This is caught by kmemleak when running compress_004_pos

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5244 
Closes #5330
2016-10-24 16:41:56 -07:00
Brian Behlendorf
13d9a004fe Fix taskq creation failure in vdev_open_children()
When creating and destroying pools in tight loop it's possible to
exhaust the number of allowed threads on a system.  This results
in taskq_create() failling and a NULL dereference.

Resolve the issue by falling back to opening the vdevs all
synchronously.

Reviewed-by: Denys Rtveliashvili <denys@rtveliashvili.name>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#521
Closes #4637
2016-10-24 13:28:58 -07:00
Tony Hutter
1bbd877049 Turn on/off enclosure slot fault LED even when disk isn't present
Previously when a drive faulted, the statechange-led.sh script would lookup
the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and
turn it on.  During testing we noticed that if you pulled out a drive, or if
the drive was so badly broken that it no longer appeared to Linux, that the
/sys/block/sd* path would be removed, and the script could not lookup the
LED entry.

To fix this, this patch looks up the disks's more persistent
"/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import.  It then
passes that path to the statechange-led script to use, rather than having the
script look it up on the fly.  This allows the script to turn on/off the slot
LEDs even when the drive is missing.

Closes #5309 
Closes #2375
2016-10-24 10:45:59 -07:00
Romain Dolbeau
24cdeaf12e Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bits
This is not useful on micro-architecture with a weak NEON
implementation (only 64 bits); the native version is slower &
the byteswap barely faster than scalar.  On A53 or A57, it's
a small improvement on scalar but OK for byteswap.

Results from an A53 system:
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation   native         byteswap       
scalar           1008227510     755880264      
aarch64_neon     1198098720     1044818671     
fastest          aarch64_neon   aarch64_neon 

Results from a A57 system:
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon 

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #5248
2016-10-21 10:55:49 -07:00
Brian Behlendorf
e4ffa98dca Fix userquota_compare() function
The AVL tree compare function requires that either -1, 0, or 1 be
returned.  However the strcmp() function only guarantees that a
negative, zero, or positive value is returned.  Therefore, the
return value of strcmp() needs to be sanitized with AVL_ISIGN.

This was initially overlooked because the x86_64 implementation
of strcmp() happens to only returns the allowed values.  This
was observed on an aarch64 platform which behaves correctly but
differently as described above.

Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5311 
Closes #5313
2016-10-21 08:23:27 -07:00
luozhengzheng
9523b15ac1 Fix coverity defects: CID 153459
CID 153459: Null pointer dereferences (FORWARD_NULL)
Accidentally introduced by #5159.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5310
2016-10-20 11:54:02 -07:00
cao
9d01680430 Fix coverity defects: CID 147551, 147552
CID 147551: Type:dereference null return value
CID 147552: Type:dereference null return value

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5279
2016-10-20 11:49:50 -07:00
cao
5a6765cf8c Fix coverity defects: CID 147472
CID 147472: Type: 'Constant' variable guards dead code

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5288
2016-10-20 11:24:01 -07:00
luozhengzheng
1f72394443 Fix coverity defects: CID 150919, 150923
CID 150919: Buffer not null terminated (BUFFER_SIZE_WARNING)
CID 150923: Buffer not null terminated (BUFFER_SIZE_WARNING)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5298
2016-10-20 11:09:39 -07:00
Brian Behlendorf
3b0ba3ba99 Linux 4.9 compat: inode_change_ok() renamed setattr_prepare()
In torvalds/linux@31051c8 the inode_change_ok() function was
renamed setattr_prepare() and updated to take a dentry ratheri
than an inode.  Update the code to call the setattr_prepare()
and add a wrapper function which call inode_change_ok() for
older kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Requires-spl: refs/pull/581/head
2016-10-20 09:39:09 -07:00
Chunwei Chen
0fedeedd30 Linux 4.9 compat: remove iops->{set,get,remove}xattr
In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and
generic_{set,get,remove}xattr are removed. xattr operations will directly
go through sb->s_xattr.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-10-20 09:39:09 -07:00
Chunwei Chen
b8d9e26440 Linux 4.9 compat: iops->rename() wants flags
In Linux 4.9, torvalds/linux@2773bf0, iops->rename() and iops->rename2() are
merged together into iops->rename(), it now wants flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-10-20 09:39:09 -07:00
Chunwei Chen
8ba3f2bf6a Remove dir inode operations from zpl_inode_operations
These operations are dir specific, there's no point putting them in
zpl_inode_operations which is for regular files.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-10-20 09:39:09 -07:00
Chunwei Chen
ae7eda1dde Linux 4.9 compat: group_info changes
In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via
->blocks to 1d array via ->gid. We change the spl cred functions accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #581
2016-10-20 09:33:28 -07:00
Chunwei Chen
87063d7dc3 Fix splat-cred.c cred usage
No need to crhold current_cred(), fix possible leak in splat_cred_test2

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556
2016-10-20 09:33:22 -07:00
Chunwei Chen
9ba3c01923 Fix crgetgroups out-of-bound and misc cred fix
init_groups has 0 nblocks, therefore calling the current crgetgroups with
init_groups would result in out-of-bound access. We fix this by returning NULL
when nblocks is 0.

Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return
blocks[0].

Also, remove all get_group_info. The cred already holds reference on the
group_info, and cred is not mutable. So there's no reason to hold extra
reference, if we hold cred.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556
2016-10-20 09:33:01 -07:00
Tony Hutter
6078881aa1 Multipath autoreplace, control enclosure LEDs, event rate limiting
1. Enable multipath autoreplace support for FMA.

This extends FMA autoreplace to work with multipath disks.  This
requires libdevmapper to be installed at build time.

2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online

Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED.  Your enclosure must
be supported by the Linux SES driver for this to work.  The enclosure LED
scripts work for multipath devices as well.  The scripts will clear the LED
when the fault is cleared.

3. Rate limit ZIO delay and checksum events so as not to flood ZED

ZIO delay and checksum events are rate limited to 5/sec in the zfs module.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2449 
Closes #3017 
Closes #5159
2016-10-19 12:55:59 -07:00
luozhengzheng
7c502b0b1d Fix coverity defects: CID 150926
CID 150926: Unchecked return value (CHECKED_RETURN)
- This case cannot occur given the existing taskq implementation
  and flags passed to task_dispatch().

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5272
2016-10-18 11:32:59 -07:00
Brian Behlendorf
6d00b5e136 Fix unused variable
Accidentally introduced by 3dfb57a, when building with debugging
disabled several variables are unused.  Resolve this by wrapping
them in ASSERTV to remove them for non-debug builds.

Reviewed by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5284
2016-10-18 10:44:44 -07:00
cao
1b81ab46d0 Fix coverity defects: CID 49339, 153393
CID 49339: Type:Buffer not null terminated
CID 153393: Type:Buffer not null terminated

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: <cao.xuewen cao.xuewen@zte.com.cn>
Closes #5296
2016-10-18 10:31:57 -07:00
luozhengzheng
b60eac3d1a Fix coverity defects: CID 150924
CID 150924: Unchecked return value (CHECKED_RETURN)
- On taskq_dispatch failure the reference must be dropped and
  this entry can be safely skipped.  This case should be impossible
  in the existing implementation but should be handled regardless.
  
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5278
2016-10-17 12:03:52 -07:00
cao
b6ca6193f7 Fix coverity defects: CID 147488, 147490
CID 147488, Type:explicit null dereferenced
CID 147490, Type:dereference null return value

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5237
2016-10-14 11:00:47 -07:00
Don Brady
3dfb57a35e OpenZFS 7090 - zfs should throttle allocations
OpenZFS 7090 - zfs should throttle allocations

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7
Closes #5258 

Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
2016-10-13 17:59:18 -07:00
cao
3f93077b02 Fix coverity defects: CID 150943, 150938
CID:150943, Type:Unintentional integer overflow
CID:150938, Type:Explicit null dereferenced

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5255
2016-10-13 14:30:50 -07:00
luozhengzheng
05852b3467 Fix coverity defects: CID 147571, 147574
CID 147571: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)
CID 147574: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5268
2016-10-13 14:25:05 -07:00
luozhengzheng
1f51b525ff Fix coverity defects: CID 153394
coverity scan CID 153394, Type:String overflow

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5263
2016-10-12 13:24:03 -07:00
Tom Caputi
ef78750d98 Fix ICP memleak introduced in #4760
The ICP requires destructors to for each crypto module that is added.
These do not necessarily exist in Illumos because they assume that
these modules can never be unloaded from the kernel. Some of this
cleanup code was missed when #4760 was merged, resulting in leaks.
This patch simply fixes that.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #4760 
Closes #5265
2016-10-12 12:52:30 -07:00
Brian Behlendorf
1697d2dcf1 Fix zfsctl_snapshot_{,un}mount() issues
Fix use after free in zfsctl_snapshot_unmount(). Use /usr/bin/env
instead of /bin/sh to fix a shell code injection flaw and allow use
with grsecurity.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Stian Ellingsen <stian@plaimi.net>  
Closes #5250 
Closes #4377
2016-10-11 09:56:28 -07:00
Tim Chase
d33931a83a Write issue taskq shouldn't be dynamic
This is as much an upstream compatibility as it's a bit of a performance
gain.

The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.

As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5236
2016-10-10 15:19:14 -07:00
Tom Caputi
57f16600b9 Porting over some ICP code that was missed in #4760
When #4760 was merged tests were added to ensure that the new checksums
were working properly. However, some of the functionality for sha2
functions were not ported over, resulting in some Coverity defects and
code that would be unstable when needed in the future. This patch
simply ports over the missing code and fixes the defects in the
process.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #4760 
Closes #5251
2016-10-10 11:34:57 -07:00
Brian Behlendorf
7515f8f63d Fix file permissions
The following new test cases need to have execute permissions set:

  userquota/groupspace_003_pos.ksh
  userquota/userquota_013_pos.ksh
  userquota/userspace_003_pos.ksh
  upgrade/upgrade_userobj_001_pos.ksh
  upgrade/setup.ksh
  upgrade/cleanup.ksh

The following source files accidentally were marked executable:

  lib/libzpool/kernel.c
  lib/libshare/nfs.c
  lib/libzfs/libzfs_dataset.c
  lib/libzfs/libzfs_util.c
  tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c
  tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c
  cmd/zed/zed_exec.c
  module/icp/core/kcf_sched.c
  module/zfs/dsl_pool.c
  module/zfs/arc.c
  module/nvpair/nvpair.c
  man/man5/zfs-module-parameters.5

Reviewed-by: GeLiXin <ge.lixin@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5241
2016-10-08 14:57:56 -07:00
Stian Ellingsen
5dc1ff29ec
Use env, not sh in zfsctl_snapshot_{,un}mount()
Call mount and umount via /usr/bin/env instead of /bin/sh in
zfsctl_snapshot_mount() and zfsctl_snapshot_unmount().

This change fixes a shell code injection flaw.  The call to /bin/sh
passed the mountpoint unescaped, only surrounded by single quotes.  A
mountpoint containing one or more single quotes would cause the command
to fail or potentially execute arbitrary shell code.

This change also provides compatibility with grsecurity patches.
Grsecurity only allows call_usermodehelper() to use helper binaries in
certain paths.  /usr/bin/* is allowed, /bin/* is not.
2016-10-08 17:43:29 +02:00
Stian Ellingsen
00b65db711
Fix use after free in zfsctl_snapshot_unmount() 2016-10-08 17:42:52 +02:00
Brian Behlendorf
690fe6479e Rename hole_birth tunable to match OpenZFS
OpenZFS decided that ignore_hole_birth was too imprecise and
incorrect a name (and went with send_holes_without_birth_time).
Rename it in ZoL too, while keeping the name "ignore_hole_birth"
pointing to the same variable for existing consumers.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #5239
2016-10-07 21:02:24 -07:00
tuxoko
0d26756665 Fix out-of-bound in per_cpu in spl_random_init
When iterating per_cpu values, we need to use for_each_possible_cpu. While
NR_CPUS indicates the number of CPU supported by the kernel, it might not
initialize all of them if the kernel decides it's not possible to use them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #578
2016-10-07 20:59:46 -07:00
Håkan Johansson
4770aa0643 Fix vdev_open_child() race on updating vdev_parent->vdev_nonrot
Updating vd->vdev_parent->vdev_nonrot in vdev_open_child()
is a race when vdev_open_child is called for many children
from a task queue.

vdev_open_child() is only called by vdev_open_children(), let
the latter update the parent vdev_nonrot member.  The update
was already there, so done twice previously.  Thus using the
same logic at the end in vdev_open_children() to update
vdev_nonrot, either we are vdev_uses_zvols() or not.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5162
2016-10-07 13:25:35 -07:00
cao
ccc92611b1 Fix coverity defects: CID 147565-147567
coverity scan CID:147567, Type:dereference null return value
coverity scan CID:147566, Type:dereference null return value
coverity scan CID:147565, Type:dereference null return value

Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5166
2016-10-07 13:19:43 -07:00
Brian Behlendorf
482cd9ee69 Fletcher4: Incremental updates and ctx calculation
Fixes ABI issues with fletcher4 code, adds support for
incremental updates, and adds ztest method for testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5164
2016-10-07 12:44:12 -07:00
Jinshan Xiong
9b7a83cbb6 OpenZFS 6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates
Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().

The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.

We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".

This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.

Upstream bugs: DLPX-44799
Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6988
ZFSonLinux-issue: https://github.com/zfsonlinux/zfs/issues/4642
OpenZFS-commit: unmerged

Porting notes:
- Added curly braces around declaration of userquota_cache_t cache to
  quiet compiler warning;
- Handled the userobj accounting the same way it proposed in this path.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
2016-10-07 09:45:13 -07:00
Jinshan Xiong
1de321e626 Add support for user/group dnode accounting & quota
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
2016-10-07 09:45:13 -07:00
lorddoskias
64c688d716 Refactor updating of immutable/appendonly flags
Move the synchronization of inode/znode i_flgas/pflags into
the respective internal zfs function. This is mostly
mechanical work and shouldn't introduce any functional
changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Issue #227 
Closes #5223
2016-10-05 14:47:29 -07:00
Gvozden Neskovic
5bf703b8f3 Fletcher4: save/reload implementation context
Init, compute, and fini methods are changed to work on internal context object.
This is necessary because ABI does not guarantee that SIMD registers will be preserved
on function calls. This is technically the case in Linux kernel in between
`kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that
don't require disabling preemption for using SIMD (osx).

Use scalar compute methods in-place for small buffers, and when the buffer size
does not meet SIMD size alignment.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-10-05 16:41:46 +02:00
Gvozden Neskovic
37f520db2d Fletcher4: Incremental using SIMD
Combine incrementally computed fletcher4 checksums. Checksums are combined
a posteriori, allowing for parallel computation on chunks to be implemented if
required. The algorithm is general, and does not add changes in each SIMD
implementation.
New test in ztest verifies incremental fletcher computations.

Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are
respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer
`b` (divided by sizeof(uint32_t)) is:

Cab[A] = Cb[A] + Ca[A]
Cab[B] = Cb[B] + Ca[B] + s * Ca[A]
Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A]
Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A]

NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation
is performed on 8MiB chunks.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-10-05 16:41:46 +02:00
luozhengzheng
e2c292bbfc Fix coverity defects: CID 150953, 147603, 147610
coverity scan CID:150953,type: uninitialized scalar variable
coverity scan CID:147603,type: Resource leak
coverity scan CID:147610,type: Resource leak

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5209
2016-10-04 18:15:57 -07:00
Brian Behlendorf
341dfdb3fd Fix p0 initializer
Due to changes in the task_struct the following warning is occurs
when initializing the global p0.  Since this structure only exists
for it's address to be taken initialize it in a manor which isn't
sensitive to internal changes to the structure.

  module/spl/spl-generic.c:58:1: error: missing braces around
  initializer [-Werror=missing-braces]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #576
2016-10-04 17:26:36 -07:00
ilovezfs
125a406e24 OpenZFS 6585 - sha512, skein, and edonr have an unenforced dependency on extensible dataset
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>

In any pool without the extensible dataset feature flag already enabled,
creating a dataset with dedup set to use one of the new checksums would
result in the following panic as soon as any data was added:

panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
&refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c
line 390

Inpsection showed that feature->fi_feature was 7, which is the value of
SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum.  This commit
adds extensible dataset as a dependency for the sha512, edonr, and skein
feature flags, which prevents the panic.

OpenZFS-issue: https://www.illumos.org/issues/6585
OpenZFS-commit: 892586e8a1
Porting Notes:
This code was originally from Illumos, but I actually ported it from:
openzfsonosx/zfs@b62a652
2016-10-03 14:51:21 -07:00
ilovezfs
4a2e9a17d5 OpenZFS 6541 - Pool feature-flag check defeated if "verify" is included in the dedup property value
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Tony Hutter <hutter2@llnl.gov>

zio_checksum_to_feature() expects a zio_checksum enum not a raw property
intval, so the new checksums weren't being detected when the
ZIO_CHECKSUM_VERIFY flag got in the way.

Given a pool without feature@sha512,

    zfs create -o dedup=sha512 naughty/fivetwelve_noverify_ds

would fail as expected since the raw intval would indeed be equal to
SPA_FEATURE_SHA512.

However,

    zfs create -o dedup=sha512,verify naughty/fivetwelve_verify_ds

would incorrectly succeed because ZIO_CHECKSUM_VERIFY would be in the
way, the raw intval would not be a member of the enum, and
zio_checksum_to_feature() would return SPA_FEATURE_NONE, with the result
that spa_feature_is_enabled() would never be called.

This was first detected with edonr, since in that case verify is
required.

This commit clears the ZIO_CHECKSUM_VERIFY flag before calling
zio_checksum_to_feature() using the ZIO_CHECKSUM_MASK and verifies in
zio_checksum_to_feature() that ZIO_CHECKSUM_MASK has been applied by the
caller to attempt to prevent the same bug from occurring again in the
future.

OpenZFS-issue: https://www.illumos.org/issues/6541
OpenZFS-commit: 971640e6aa

Porting notes:
This code was originally from Illumos, but I actually ported it from:
openzfsonosx/zfs@bef06e1
2016-10-03 14:51:21 -07:00
Tony Hutter
3c67d83a8a OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:

    b5e030c8db

The list of porting changes includes:

- Copied module/icp/include/sha2/sha2.h directly from illumos

- Removed from module/icp/algs/sha2/sha2.c:
	#pragma inline(SHA256Init, SHA384Init, SHA512Init)

- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
  it now takes in an extra parameter.

- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c

- Added skein & edonr to libicp/Makefile.am

- Added sha512.S.  It was generated from sha512-x86_64.pl in Illumos.

- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.

- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
  to not #include the non-existant endian.h.

- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
  around a compiler warning.

- Fixup test files:
	- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
	- Remove <note.h> and define NOTE() as NOP.
	- Define u_longlong_t
	- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
	- Rename NULL to 0 in "no test vector" array entries to get around a
	  compiler warning.
	- Remove "for isa in $($ISAINFO); do" stuff
	- Add/update Makefiles
	- Add some userspace headers like stdio.h/stdlib.h in places of
	  sys/types.h.

- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.

- Update scripts/zfs2zol-patch.sed

- include <sys/sha2.h> in sha2_impl.h

- Add sha2.h to include/sys/Makefile.am

- Add skein and edonr dirs to icp Makefile

- Add new checksums to zpool_get.cfg

- Move checksum switch block from zfs_secpolicy_setprop() to
  zfs_check_settable()

- Fix -Wuninitialized error in edonr_byteorder.h on PPC

- Fix stack frame size errors on ARM32
  	- Don't unroll loops in Skein on 32-bit to save stack space
  	- Add memory barriers in sha2.c on 32-bit to save stack space

- Add filetest_001_pos.ksh checksum sanity test

- Add option to write psudorandom data in file_write utility
2016-10-03 14:51:15 -07:00
Romain Dolbeau
62a65a654e Add parity generation/rebuild using 128-bits NEON for Aarch64
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #4801
2016-10-03 09:44:00 -07:00
luozhengzheng
aecdc70604 Fix coverity defects: CID 147448, 147449, 147450, 147453, 147454
coverity scan CID:147448,type: unchecked return value
coverity scan CID:147449,type: unchecked return value
coverity scan CID:147450,type: unchecked return value
coverity scan CID:147453,type: unchecked return value
coverity scan CID:147454,type: unchecked return value

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5206
2016-10-02 11:24:54 -07:00
Brian Behlendorf
6c2a66bfa8 Fix aarch64 type warning
Explicitly cast type in splat-rwlock.c test case to silence
the following warning.

  warning: format ‘%ld’ expects argument of type ‘long int’,
  but argument N has type ‘int’

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #574
2016-10-01 18:33:01 -07:00
candychencan
0ca5261be4 Fix NULL deref in kcf_remove_mech_provider
In the default case the function must return to avoid dereferencing
'prov_mech' which will be NULL.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: candychencan <chen.can2@zte.com.cn>
Closes #5134
2016-09-30 16:04:43 -07:00
cao
0a8f18f932 Fix coverity defects: CID 147563, 147560
coverity scan CID:147563, Type:dereference null return value
coverity scan CID:147560, Type:dereference null return value

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5168
2016-09-30 15:56:17 -07:00
GeLiXin
470f12d631 Fix coverity defects: CID 147531 147532 147533 147535
coverity scan CID:147531,type: Argument cannot be negative
- may copy data with negative size
coverity scan CID:147532,type: resource leaks
- may close a fd which is negative
coverity scan CID:147533,type: resource leaks
- may call pwrite64 with a negative size
coverity scan CID:147535,type: resource leaks
- may call fdopen with a negative fd

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes #5176
2016-09-30 15:47:57 -07:00
Brian Behlendorf
2db28197fe Fix cppcheck warning in buf_init()
Cppcheck 1.63 erroneously complains about an uninitialized value
in buf_init().  Newer versions of cppcheck (1.72) handle this
correctly but we'll initialize the value anyway to silence the
warning.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5203
2016-09-30 15:04:21 -07:00
Gvozden Neskovic
6ca636a152 Avoid undefined shift overflow in fzap_cursor_retrieve()
Avoid calculating (1<<64) if lh_prefix_len == 0. Semantics of the method remain
the same.

Assert (lh_prefix_len > 0) in zap_expand_leaf() to detect possibly the same
problem.

Issue #4883

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-09-29 15:55:41 -07:00
Gvozden Neskovic
4ca9c1de12 Explicit integer promotion for bit shift operations
Explicitly promote variables to correct type. Undefined behavior is
reported because length of int is not well defined by C standard.

Issue #4883

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-09-29 15:55:41 -07:00
Gvozden Neskovic
031d7c2fe6 fix: Shift exponent too large
Undefined operation is reported by running ztest (or zloop) compiled with GCC
UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection
with large enough offset values. Logically, left shift operation would work,
but bit shift semantics in C, and limitation of uint64_t, do not produce desired
result.

Issue #5059, #4883

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-09-29 15:55:41 -07:00
Isaac Huang
e8ac4557af Explicit block device plugging when submitting multiple BIOs
Without plugging, the default 'noop' scheduler will not merge
the BIOs which are part of a large ZIO.

Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5181
2016-09-29 13:13:31 -07:00
cao
c9d61adbf8 Fix coverity defects: 147658, 147652, 147651
coverity scan CID:147658, Type:copy into fixed size buffer.
coverity scan CID:147652, Type:copy into fixed size buffer.
coverity scan CID:147651, Type:copy into fixed size buffer.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5160
2016-09-29 12:06:14 -07:00
lorddoskias
12fa7f3436 Refactor inode->i_mode management
Refactor the code in such a way so that inode->i_mode is being set
at the same time zp->z_mode is being changed. This has the effect of
keeping both in sync without relying on zfs_inode_update.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes #5158
2016-09-27 14:08:52 -07:00
cao
680eada9b0 Fix coverity defects: CID 147650, 147649, 147647, 147646
coverity scan CID:147650, Type:copy into fixed size buffer.
coverity scan CID:147649, Type:copy into fixed size buffer.
coverity scan CID:147647, Type:copy into fixed size buffer.
coverity scan CID:147646, Type:copy into fixed size buffer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5161
2016-09-25 15:08:28 -07:00
Brian Behlendorf
7571033285 Fix multilist_create() memory leak
In arc_state_fini() the `arc_l2c_only->arcs_list[*]` multilists
must be destroyed.  This accidentally regressed in d3c2ae1c.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5151 
Closes #5152
2016-09-23 10:55:10 -07:00
tuxoko
d5b897a6a1 Linux 4.7 compat: Fix deadlock during lookup on case-insensitive
We must not use d_add_ci if the dentry already has the real name. Otherwise,
d_add_ci()->d_alloc_parallel() will find itself on the lookup hash and wait
on itself causing deadlock.

Tested-by: satmandu
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5124 
Closes #5141 
Closes #5147 
Closes #5148
2016-09-22 19:09:16 -07:00
kernelOfTruth aka. kOT, Gentoo user
51907a31bc OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records
Authored by: Matt Krantz <matt.krantz@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/7230
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2
Closes #5112
2016-09-22 16:01:19 -07:00
luozhengzheng
160987b576 Fix coverity defects
coverity scan CID:147633,type: sizeof not portable
coverity scan CID:147637,type: sizeof not portable
coverity scan CID:147638,type: sizeof not portable
coverity scan CID:147640,type: sizeof not portable

In these particular cases sizeof (XX **) happens to be equal to sizeof (X *),
but this is not a portable assumption.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5144
2016-09-21 18:09:00 -07:00
Isaac Huang
da8d57488b Reduce noise in tracing logs
dbuf_read_impl() returns (SET_ERROR(err)) when err can be 0, which adds
lots of noise in tracing logs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #4430 
Closes #5146
2016-09-21 13:37:20 -07:00
BearBabyLiu
609603a5d3 Fix coverity defects
coverity scan CID:147504 Type: Explicit null dereferenced
Reason: passing null pointer dl to zfs_dirent_unlock

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn>
Closes #5131
2016-09-20 19:09:22 -07:00
Tim Chase
25e2ab16be Fix arc_adjust_meta_balanced()
The type of "adjustmnt" was erroneously changed to unsigned when the compressed
ARC code was ported in d3c2ae1c08.

As a result of it being unsigned, the balanced metadata eviction logic
would evict all of the non-metadata.

Reviewed-by: Chris Severance <github.severach@spamgourmet.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: David Quigley <david.quigley@intel.com>
Signed-off-by: Tim Chase <tim@onlight.com>
Closes #5128 
Closes #5129
2016-09-19 09:28:35 -07:00
luozhengzheng
30f3f2e13c Fix Coverity defects
CID 147659, 150952 and 147645

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes #5103
2016-09-17 15:08:54 -07:00
Brian Behlendorf
cb81c0c588 Increase spl_kmem_alloc_warn limit
In order to support ABD with large blocks the spl_kmem_alloc_warn
limit needs to be increased to 64K.

A 16M block requires that pointers be stored for 4096 4K-pages
on an x86_64 system.  Each of these pointers is 8 bytes requiring
an allocation of 8*4096=32,768 bytes.  The addition of a small
header to this structure pushes the allocation over the default
32K warning threshold.

In addition, fix a small bug where MAX was used instead of MIN
when setting the default.  This ensures a reasonable limit is
still set on systems with page sizes larger then 4K.

Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #571
2016-09-16 17:10:36 -07:00
Brian Behlendorf
9ea9e0b9a1 Enable ignore_hole_birth module option by default
Enable ignore_hole_birth by default until all known hole birth bugs
have been resolved and relevant test cases added.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4809
Closes #5099
2016-09-16 14:05:30 -07:00
Nikolay Borisov
87f9371aef Simplify time handling logic in zfs_settattr
Simplify time handling in zfs_setattr by mimicking the logic in
setattr_copy from the linux kernel. In order to achieve this
in the case when ZFS' log is being replayed it is necessary
to unconditionally set the ctime in zfs_replay_setattr.

Also use the timespec_trunc function when assigning values to the
generic inode struct. This is currently a noop since zfs sets
s_time_gran to 1, however in the future rules about precision might
change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes #4916
2016-09-13 12:00:18 -07:00
Nikolay Borisov
9f5f0019ab Refactor generic inode time updating
ZFS doesn't provide a custom update_time method meaning it delegates
this job to the generic VFS layer. The only time when it needs to
set the various *time values is when the inode is being marshalled
to/from the disk. Do this by moving the relevant code from
zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result
from this change it is no longer necessary to have multiple versions
of the zfs_inode_update function - so just nuke them and leave only
one.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Issue #227
Closes #4916
2016-09-13 11:57:37 -07:00
Dan Kimmel
524b4217b8 DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone()
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:59:13 -07:00
Tom Caputi
c17bcf83da Enable raw writes to perform dedup with verification
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: David Quigley <david.quigley@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #5078
2016-09-13 09:59:04 -07:00
Dan Kimmel
2aa34383b9 DLPX-40252 integrate EP-476 compressed zfs send/receive
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:58:58 -07:00
George Wilson
d3c2ae1c08 OpenZFS 6950 - ARC should cache compressed data
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
2016-09-13 09:58:33 -07:00
Tim Chase
43924bfeaa Remove redundant assignments to arc_c
Several assignments to arc_c had no effect because it is ultimately
initialized to arc_c_max.

This aligns ZoL better with the upstream code which removed these
assignments some time ago.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@onlight.com>
Closes #5081
2016-09-12 12:40:30 -07:00
Nikolay Borisov
67d6082494 Refactor spa_load_l2cache to make build happy
In case sav->sav_config was NULL the body of the function
would skip the iteration of the l2 cache devices and will
just cleanup the old devices. However, this wasn't very obvious
since the null check was performed after the loop body and after
the old devices were cleaned. Refactor the code so that it's now
obvious when the iteration of the l2cache devices is skipped.

This fixes the following cppcheck warning:

[module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes #5087
2016-09-12 12:40:03 -07:00
Tim Chase
20aa7a4e31 Free property names with spa_strfree() rather than strfree()
Since they're allocated with spa_strdup(), they should be freed with
spa_strfree() so the proper length buffer is freed.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5082
Closes #5086
2016-09-12 09:45:26 -07:00
Don Brady
d02ca37979 Bring over illumos ZFS FMA logic -- phase 1
This first phase brings over the ZFS SLM module, zfs_mod.c, to handle
auto operations in response to disk events. Disk event monitoring is
provided from libudev and generates the expected payload schema for
zfs_mod. This work leverages the recently added devid and phys_path
strings in the vdev label.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4673
2016-09-01 11:39:45 -07:00
luozhengzheng
0b284702b7 Delete unreferenced function zfs_ereport_send_interim_checksum
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5055
2016-09-01 11:39:45 -07:00
luozhengzheng
ca8587a517 kmem_zalloc with KM_SLEEP will never return NULL
These allocations can never fail.  Leaving the error handling
code here gives the impression they can so it has been removed.

Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5048
2016-09-01 11:39:45 -07:00
Gvozden Neskovic
ee36c709c3 Performance optimization of AVL tree comparator functions
perf: 2.75x faster ddt_entry_compare()
    First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).

Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently.  Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:

old	6.85789 s
new	2.49089 s

perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
    Compute the result directly instead of using conditionals

perf: zfs_range_compare()
    Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.

perf: spa_error_entry_compare()
    `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033
2016-08-31 14:35:34 -07:00
Hajo Möller
82ab6848cc Fix "zpool get guid,freeing,leaked" source
`zpool get guid,freeing,leaked` shows SOURCE as `default`, it should
be `-` as those props are not editable.

Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it
stays `ZPROP_SRC_NONE`.  Make src const to avoid future mistakes

Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4170
2016-08-30 15:57:15 -07:00
cao
8f50bafb04 Delete unused zfsctl_snapdir_inactive declaration
zfsctl_snapdir_inactive is defined in zfs-0.6.3.  In zfs-0.6.5.7
this is declaration remains even though the implementation was
removed in commit 278bee93.  Removed fastreboot_disable_highpil
which is also unused.

Signed-off-by: caoxuewen cao.xuewen@zte.com.cn
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5042
2016-08-30 14:33:40 -07:00
Simon Klinkert
db707ad094 OpenZFS 6940 - Cannot unlink directories when over quota
From user perspective, I would expect that ZFS is always able
to remove files and directories even when the quota is exceeded.

Authored by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6940
OpenZFS-issue: https://www.illumos.org/issues/6334
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916
Closes #5044
2016-08-30 14:33:04 -07:00
Alexander Motin
755065f3dc OpenZFS 6322 - ZFS indirect block predictive prefetch
For quite some time I was thinking about possibility to prefetch
ZFS indirection tables while doing sequential reads or writes.
Recent changes in predictive prefetcher made that much easier to
do. My tests on zvol with 16KB block size on 5x striped and 2x
mirrored pool of 10 disks show almost double throughput on sequential
read, and almost tripple on sequential rewrite. While for read alike
effect can be received from increasing maximal prefetch distance
(though at higher memory cost), for rewrite there is no other
solution so far.

Authored by: Alexander Motin <mav@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6322
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413
Closes #5040

Porting notes:
- Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due
  to commit 5f6d0b6 'Handle block pointers with a corrupt logical size'

- Difference from upstream in module/zfs/dmu_zfetch.c,
  uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance

- Variables have been initialized at the beginning of the function
 (void dmu_zfetch) to resemble the order of occurrence and account
 for C99, C11 mode errors.
2016-08-30 14:26:55 -07:00
Matthew Ahrens
98ace739bd OpenZFS 7086 - ztest attempts dva_get_dsize_sync on an embedded blockpointer
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at
the db_blkptr, to prevent it from being changed by syncing context.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7086
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317
Closes #5039
2016-08-30 14:25:50 -07:00
GeLiXin
c40db193a5 Fix: Build warnings with different gcc optimization levels in debug mode
This fix resolves warnings reported during compiling with different gcc
optimization levels in debug mode,

Test tools:
gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago)

List of warnings:
CFLAGS=-O1 ./configure --enable-debug ;make
../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’:
../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function
../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here

CFLAGS=-Os ./configure --enable-debug ; make
libzfs_dataset.c: In function ‘zfs_prop_set_list’:
libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5022
2016-08-29 12:46:18 -07:00
GeLiXin
9907cc1cc8 Add zfs_arc_meta_limit_percent tunable
ARC will evict meta buffers that exceed the arc_meta_limit. Before a further
investigating on whether we should take special protection on meta buffers,
this tunable make arc_meta_limit adjustable for different workloads.

People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko,
so some range check is added to guarantee a suitable arc_meta_limit.

Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style
tunable as well.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4957
2016-08-23 13:03:01 -07:00
Tim Chase
3e635ac15c Prevent reclaim in send_traverse_thread()
As is the case with traverse_prefetch_thread(), the deep stacks caused
by traversal require disabling reclaim in the send traverse thread.

Also, do the same for receive_writer_thread() in which similar problems
have been observed.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4912
Closes #4998
2016-08-22 16:12:05 -07:00
Gvozden Neskovic
9cc1844a1d Linux compat: Grsecurity kernel
API Change: Module parameter set/get methods take const parameter in
Grsecurity kernel v4.7.1

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4997
Closes #5001
2016-08-22 10:05:45 -07:00
Matthew Ahrens
2bce8049c3 OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109
Closes #4641
Closes #4972
2016-08-19 12:48:03 -07:00
Matthew Ahrens
8bea981504 OpenZFS 7003 - zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7003
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108
Closes #4972
2016-08-19 12:35:23 -07:00
heary-cao
ee6370a7a4 Fix spa config generate memory leak in spa_load_best function
When spa retry load succeeds and spa recovery is requested it may
leak in spa_load_best function.  Always free the generated config
when it is not assigned to the spa.

Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4940
2016-08-19 11:17:12 -07:00
GeLiXin
aeb9baa618 Fix: handle NULL case in spl_kmem_free_track()
When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all
the buffers alloced by kmem_alloc() and kmem_zalloc().  If a NULL
pointer which indicates no track info in SPL is passed to
spl_kmem_free_track, we just ignore it.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4967
Closes #567
2016-08-19 09:14:24 -07:00
Paul Dagnelie
32d41fb73a OpenZFS 7176 - Yet another hole birth issue
This is another bug in the long line of hole-birth related issues. In
this particular case, it was discovered that a previous hole-birth fix
(illumos bug 6513, commit bc77ba73) did not cover as many cases as we
thought it did. While the issue worked in the case of hole-punching
(writing zeroes to a large part of a file), it did not deal with
truncation, and then writing beyond the new end of the file.

The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7176
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad
Upstream-bugs: DLPX-46009

Porting notes:
- Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ;
  keep previous position of the initialization
2016-08-18 09:26:44 -07:00
Matthew Ahrens
d9eea113f8 It is not necessary to zero struct dbuf_hold_impl_data
Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a
considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to
`kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`,
which is around 2KiB.  This structure is used as a stack, to limit the size of
the C stack as dbuf_hold() calls itself recursively.  We make a recursive call
to hold the parent's dbuf when the requested dbuf is not found.  The vast
majority of the time, the parent or grandparent indirect dbuf is cached, so the
number of recursive calls is very low.  However, we initialize this entire
array for every call to dbuf_hold().

To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()`
instead of `kmem_zalloc()`.  __dbuf_hold_impl_init is changed to initialize all
members of the struct before they are used.  I observed ~5% performance
improvement on a workload which creates many files.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4974
2016-08-16 15:27:17 -07:00
Gvozden Neskovic
fc897b24b2 Rework of fletcher_4 module
- Benchmark memory block is increased to 128kiB to reflect real block sizes more
accurately. Measurements include all three stages needed for checksum generation,
i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset
overhead of time function.

- Fastest implementation selects native and byteswap methods independently in
benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()`
are introduced.

- Implementation mutex lock is replaced by atomic variable.

- To save time, benchmark is not executed in userspace. Instead, highest supported
implementation is used for fastest. Default userspace selector is still 'cycle'.

- `fletcher_4_native/byteswap()` methods use incremental methods to finish
calculation if data size is not multiple of vector stride (currently 64B).

- Added `fletcher_4_native_varsize()` special purpose method for use when buffer size
is not known in advance. The method does not enforce 4B alignment on buffer size, and
will ignore last (size % 4) bytes of the data buffer.

- Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows
throughput for all supported implementations (in B/s), native and byteswap,
as well as the code [fastest] is running.

Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`:
implementation   native         byteswap
scalar           4768120823     3426105750
sse2             7947841777     4318964249
ssse3            7951922722     6112191941
avx2             13269714358    11043200912
fastest          avx2           avx2

Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`:
implementation   native         byteswap
scalar           1291115967     1031555336
sse2             2539571138     1280970926
ssse3            2537778746     1080016762
avx2             4950749767     1078493449
avx512f          9581379998     4010029046
fastest          avx512f        avx512f

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4952
2016-08-16 14:11:55 -07:00
Gvozden Neskovic
70b258fc96 Fletcher4 implementation using avx512f instruction set
Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per
loop iteration. Size alignment of main fletcher4 methods is adjusted
accordingly. New implementation is called 'avx512f'.

Note: byteswap method can be implemented more efficiently when avx512bw hardware
becomes available. Currently, it is ~ 2x slower than native method.

Table shows result of full (native) fletcher4 calculation for different buffer size:

fletcher4   4KB     16KB    64KB    128KB   256KB   1MB     16MB
--------------------------------------------------------------------
[scalar]    1213    1228    1231    1231    1225    1200    1160
[sse2]      2374    2442    2459    2456    2462    2250    2220
[avx2]      4288    4753    4871    4893    4900    4050    3882
[avx512f]   5975    8445    9196    9221    9262    6307    5620

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952
2016-08-16 14:11:14 -07:00
Rich Ercolani
6d836e6f8b Add tunable to ignore hole_birth
Adds a module option which disables the hole_birth optimization
which has been responsible for several recent bugs, including
issue #4050.

Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4833
2016-08-15 09:52:56 -07:00
GeLiXin
e35c5a8265 Fix incorrect pool state after import
Import a raidz pool which has a vdev with a bad label, zpool status
shows the right state of the dev, but the wrong state of the pool.
The pool state should be DEGRADED, not ONLINE.

We examine the label in vdev_validate while in spa_load_impl, the bad
label can be detected but doesn't propagate its state to the parent.
There are other chances to propagate state in the following vdev_load
if we failed to load DTL, but our pool is raidz1 which can tolerate a
faulted disk.  So we lost the last chance to correct the pool state.

Propagate the leaf vdev's state to parent if its label was corrupted,
as is done elsewhere in vdev_validate.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #4948
2016-08-12 13:46:51 -07:00
Hans Rosenfeld
fb390aafc8 OpenZFS 5997 - FRU field not set during pool creation and never updated
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283

Porting Notes:

In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.

Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux.  In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.

The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.

  ereport.fs.zfs.config.sync         sysevent.fs.zfs.config_sync
  ereport.fs.zfs.zpool.destroy       sysevent.fs.zfs.pool_destroy
  ereport.fs.zfs.zpool.reguid        sysevent.fs.zfs.pool_reguid
  ereport.fs.zfs.vdev.remove         sysevent.fs.zfs.vdev_remove
  ereport.fs.zfs.vdev.clear          sysevent.fs.zfs.vdev_clear
  ereport.fs.zfs.vdev.check          sysevent.fs.zfs.vdev_check
  ereport.fs.zfs.vdev.spare          sysevent.fs.zfs.vdev_spare
  ereport.fs.zfs.vdev.autoexpand     sysevent.fs.zfs.vdev_autoexpand
  ereport.fs.zfs.resilver.start      sysevent.fs.zfs.resilver_start
  ereport.fs.zfs.resilver.finish     sysevent.fs.zfs.resilver_finish
  ereport.fs.zfs.scrub.start         sysevent.fs.zfs.scrub_start
  ereport.fs.zfs.scrub.finish        sysevent.fs.zfs.scrub_finish
  ereport.fs.zfs.bootfs.vdev.attach  sysevent.fs.zfs.bootfs_vdev_attach
2016-08-12 13:06:48 -07:00
Jason Zaman
a3600a106d icp: mark asm files with noexec stack
If there is no explicit note in the .S files, the obj file will mark it
as requiring an executable stack. This is unneeded and causes issues on
hardened systems.

More info:
https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart

Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4947
Closes #4962
2016-08-12 09:51:26 -07:00
Jason Zaman
a9947ce771 icp: add no_const for PaX Compat
The constify plugin will automatically constify a class of types that contain
only function pointers. The icp structs fail to build if this is enabled with
the following error. The no_const attribute makes the plugin skip those
structs.

module/icp/spi/kcf_spi.c: In function ‘copy_ops_vector_v1’:
module/icp/spi/kcf_spi.c:61:16: error: assignment of read-only location ‘*dst_ops->cou.cou_v1.co_control_ops’
  *((dst)->ops) = *((src)->ops);
                ^
module/icp/spi/kcf_spi.c:74:2: note: in expansion of macro ‘KCF_SPI_COPY_OPS’
  KCF_SPI_COPY_OPS(src_ops, dst_ops, co_control_ops);
  ^

Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4947
Closes #4962
2016-08-12 09:51:19 -07:00
Matthew Ahrens
169ab07cc8 OpenZFS 7263 - deeply nested nvlist can overflow stack
nvlist_pack() and nvlist_unpack are implemented recursively, which can
cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist
which contains an nvlist, which contains an nvlist, which...

Unprivileged users can pass an nvlist to the kernel via certain ioctls
on /dev/zfs, which the kernel will unpack without additional permission
checking or validation. Therefore, an unprivileged user can cause the
kernel's stack to overflow and panic.

Ideally, these functions would be implemented non-recursively. As a
quick fix, this patch limits the depth of the recursion and returns an
error when attempting to pack and unpack a deeply-nested nvlist.

Signed-off-by: Adam Leventhal <ahl@delphix.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/7263
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d

-
2016-08-11 15:58:03 -07:00
Chen Haiquan
d9c97ec08b Use file_dentry and file_inode wrappers
Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").

This problem crashes system when use zfs as a layer of overlayfs.

Signed-off-by: Chen Haiquan <oc@yunify.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4914
Closes #4935
2016-08-11 12:06:37 -07:00
GeLiXin
d5884c3453 Fix indefinite article
The indefinite article before nvlist should be "an", not "a".

We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should
stay the same as we are such a strict filesystem.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4941
2016-08-11 11:23:49 -07:00
Brian Behlendorf
e5fe9ddeec Remove custom root pool import code
Non-Linux OpenZFS implementations require additional support to be
used a root pool.  This code should simply be removed to avoid
confusion and improve readability.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
2016-08-11 11:19:34 -07:00
Brian Behlendorf
cf41432c70 Linux 4.8 compat: Fix removal of bio->bi_rw member
All users of bio->bi_rw have been replaced with compatibility wrappers.
This allows the kernel specific logic to be abstracted away, and for
each of the supported cases to be documented with the wrapper.  The
updated interfaces are as follows:

* void blk_queue_set_write_cache(struct request_queue *, bool, bool)
* boolean_t bio_is_flush(struct bio *)
* boolean_t bio_is_fua(struct bio *)
* boolean_t bio_is_discard(struct bio *)
* boolean_t bio_is_secure_erase(struct bio *)
* VDEV_WRITE_FLUSH_FUA

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
2016-08-11 11:19:34 -07:00
Gvozden Neskovic
689f093ebc Build user-space with different gcc optimization levels
This fix resolves warnings reported during compiling of user-space
libraries with different gcc optimization levels.

Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora).
The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast.

List of warnings:

[GCC 4.9.2 -Os]
libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized]

[GCC 4.9.2 -Og]
fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized]

[GCC 4.9.2 -Ofast]
u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

[GCC 6.1.1]
Resolved in PR #4907

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4937
2016-08-09 14:40:35 -07:00
Chunwei Chen
afb6c031e8 Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointer
Starting from Linux 4.7, get_acl will set acl cache pointer to temporary
sentinel value before calling i_op->get_acl. Therefore we can't compare
against ACL_NOT_CACHED and return.

Since from Linux 3.14, get_acl already check the cache for us, so we
disable this in zpl_get_acl.

Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4944
Closes #4946
2016-08-09 10:03:04 -07:00
Brian Behlendorf
4b908d3220 Linux 4.8 compat: posix_acl_valid()
The posix_acl_valid() function has been updated to require a
user namespace.  Filesystem callers should normally provide the
user_ns from the super block associcated with the ACL; the
zpl_posix_acl_valid() wrapper has been added for this purpose.
See https://github.com/torvalds/linux/commit/0d4d717f for
complete details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4922
2016-08-08 11:46:40 -07:00
Brian Behlendorf
e85a6396b0 Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING
Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING
configure checks, all supported kernel provide this functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4922
2016-08-08 11:46:32 -07:00
Nikolay Borisov
64aefee1b8 Fix interaction between userns uid/gid and SA
* When the uid/gid change is handled in zfs_setattr we want to
actually adjust the user passed uid to a KUID and write that to disk.

* In trace points use the i_uid member without doing translation,
since it has already been performed.

* Use kuid in zfs_aclset_common

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4928
2016-08-08 10:47:43 -07:00
Gaurav Kumar
cf2731e65b arc_meta_limit should be updated when arc_max is changed.
When arc_max is increased, arc_meta_limit will not be updated to 3/4
of the new arc_c_max value.  This was done originally to preserve any
existing maximum value.  This turned out to be counter intuitive to
users and this fix changes that behavior.  If zfs_arc_meta_limit is
non-default, it will be picked up later in the ARC tuning function.

Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4893
2016-08-02 13:43:36 -07:00
Brian Behlendorf
efe7978d89 Fix gcc self-comparison warning
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is
detected by gcc in metaslab_alloc().  Resolve the warning by passing
a physical size of 0 to BP_SET_BIRTH() as it done by other callers.

  module/zfs/metaslab.c: In function ‘metaslab_alloc’:
  module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates
      to true [-Werror=tautological-compare]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4907
2016-08-02 13:14:18 -07:00
Tony Hutter
4eb0db42d3 Fix possible VDEV stats array overflow
Fix a possible VDEV statistics array overflow when ZIOs with
ZIO_PRIORITY_NOW complete.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4883
Closes #4917
2016-08-02 08:45:24 -07:00
Nikolay Borisov
ba2fe6affb Move assignment of i_blkbits field
Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time
zfs_inode_update_impl is called. Since this value never changes
move its assignment to at inode creation time.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4906
2016-07-29 15:34:12 -07:00
Nikolay Borisov
e334e828a6 Unify license of icp module with the rest of zfs
The newly added icp module uses a hardcoded value of CDDL for the license,
however in local development one might want to change that to something
else in order to facilitate compiling against lock debugging enabled kernel.
All modules of the zfs use the ZFS_META_LICNSE string which is replaced with
the value held in the META file. One can modify the value in the META file
once and then rerun the configure to have all modules' licenses changed.

Change the icp module license string to be ZFS_META_LICENSE so that it
falls under the same paradigm.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4905
2016-07-29 15:34:12 -07:00
heary-cao
9f3d1407dc Fix zfs_allow_log_destroy() NULL dereference
In zfs_ioc_log_history() function the tsd_set() function is called
with NULL which causes the zfs_allow_log_destroy() to be run.  In
this case the passed value will be NULL.  This is normally entirely
safe because strfree() maps directly to kfree() which may be passed
a NULL.  However, since alternate implementations of strfree() may
not handle this gracefully add a check for NULL.

Observed under an embedded Linux 2.6.32.41 kernel running the
automated testing while running the ZFS Test Suite.

Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4872
2016-07-29 15:34:12 -07:00
Chunwei Chen
3b86aeb295 Linux 4.8 compat: REQ_OP and bio_set_op_attrs()
New REQ_OP_* definitions have been introduced to separate the
WRITE, READ, and DISCARD operations from the flags.  This included
changing the encoding of bi_rw.  It places REQ_OP_* in high order
bits and other stuff in low order bits.  This encoding is done
through the new helper function bio_set_op_attrs.  For complete
details refer to:

https://github.com/torvalds/linux/commit/f215082
https://github.com/torvalds/linux/commit/4e1b2d5

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4892
Closes #4899
2016-07-29 14:48:19 -07:00
Brian Behlendorf
bbb1b6cea7 Linux 4.8 compat: submit_bio()
The rw argument has been removed from submit_bio/submit_bio_wait.
Callers are now expected to set bio->bi_rw instead of passing it
in.  See https://github.com/torvalds/linux/commit/4e49ea4a for
complete details.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
2016-07-29 14:48:00 -07:00
Brian Behlendorf
b7c7008ba2 Linux 4.8 compat: rw_semaphore atomic_long_t count
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type.  A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro.  See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #563
2016-07-29 14:17:53 -07:00
Richard Yao
f26b4b3c8a txg visibility code should not execute under tc_open_lock
The memory allocation and locking in `spa_txg_history_*()` can
potentially block txg_hold_open for arbitrarily long periods of time.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4333
2016-07-27 14:11:13 -07:00
Brian Behlendorf
fcf64f45d9 Fix zdb crash with 4K-only devices
Here's the problem - on 4K native devices in userland on
Linux using O_DIRECT, buffers must be 4K aligned or I/O
will fail with EINVAL, causing zdb (and others) to coredump.
Since userland probably doesn't need optimized buffer caches,
we just force 4K alignment on everything.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #4479
2016-07-27 13:38:46 -07:00
Colin Ian King
bf18fd89f9 void integer overflow on computation of refquota_slack
DMU_MAX_ACCESS should be cast to a uint64_t otherwise the
multiplication of DMU_MAX_ACCESS with spa_asize_inflation will
be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS
is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will
lead to an overflow.

Found by static analysis with CoverityScan 0.8.5

CID 150942 (#1 of 1): Unintentional integer overflow
  (OVERFLOW_BEFORE_WIDEN)
overflow_before_widen: Potentially overflowing expression
  67108864 * spa_asize_inflation with type int (32 bits, signed)
  is evaluated using 32-bit arithmetic, and then used in a context
  that expects an expression of type uint64_t (64 bits, unsigned).

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4889
2016-07-27 13:38:46 -07:00
Gvozden Neskovic
a64f903b06 Fixes for issues found with cppcheck tool
The patch fixes small number of errors/false positives reported by `cppcheck`,
static analysis tool for C/C++.

cppcheck 1.72

$ cppcheck . --force --quiet
[cmd/zfs/zfs_main.c:4444]: (error) Possible null pointer dereference: who_perm
[cmd/zfs/zfs_main.c:4445]: (error) Possible null pointer dereference: who_perm
[cmd/zfs/zfs_main.c:4446]: (error) Possible null pointer dereference: who_perm
[cmd/zpool/zpool_iter.c:317]: (error) Uninitialized variable: nvroot
[cmd/zpool/zpool_vdev.c:1526]: (error) Memory leak: child
[lib/libefi/rdwr_efi.c:1118]: (error) Memory leak: efi_label
[lib/libuutil/uu_misc.c:207]: (error) va_list 'args' was opened but not closed by va_end().
[lib/libzfs/libzfs_import.c:1554]: (error) Dangerous usage of 'diskname' (strncpy doesn't always null-terminate it).
[lib/libzfs/libzfs_sendrecv.c:3279]: (error) Dereferencing 'cp' after it is deallocated / released
[tests/zfs-tests/cmd/file_write/file_write.c:154]: (error) Possible null pointer dereference: operation
[tests/zfs-tests/cmd/randfree_file/randfree_file.c:90]: (error) Memory leak: buf
[cmd/zinject/zinject.c:1068]: (error) Uninitialized variable: dataset
[module/icp/io/sha2_mod.c:698]: (error) Uninitialized variable: blocks_per_int64

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1392
2016-07-27 13:31:22 -07:00
Tim Chase
25458cbef9 Limit the amount of dnode metadata in the ARC
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made.  The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
2016-07-25 15:26:38 -07:00
Tim Chase
e6603b7c1f Fix sync behavior for disk vdevs
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer.  This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.

In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start().  The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b.  In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.

The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency.  This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.

To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.

For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4858
2016-07-25 14:24:47 -07:00
Brian Behlendorf
273ff9b5cc Fix uninitialized variable in avl_add()
Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.

module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
    in this function [-Wmaybe-uninitialized]
  avl_insert(tree, new_node, where);

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-07-25 14:21:34 -07:00
Nikolay Borisov
2c6abf15ff Remove znode's z_uid/z_gid member
Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
2016-07-25 13:21:49 -07:00
Tom Caputi
77943bc1dc Fix for metaslab_fastwrite_unmark() assert failure
Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4267
2016-07-25 13:21:43 -07:00
Tom Caputi
f4bc1bbe11 Fix for compilation error when using the kernel's CONFIG_LOCKDEP
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
2016-07-21 18:49:50 -07:00
Tom Caputi
0b04990a5d Illumos Crypto Port module added to enable native encryption in zfs
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
2016-07-20 10:43:30 -07:00
Chunwei Chen
be88e733a6 Fix NULL pointer in zfs_preumount from 1d9b3bd
When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.

In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4867
Issue #4854
2016-07-20 09:05:43 -07:00
Gvozden Neskovic
26a08b5ca9 RAIDZ parity kstat rework
Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4860
2016-07-19 16:43:07 -07:00
Gvozden Neskovic
c9187d867f Fixes and enhancements of SIMD raidz parity
- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
2016-07-19 16:43:07 -07:00
Chunwei Chen
1d9b3bd8fb Wait iput_async before evict_inodes to prevent race
Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes.  This means it could
destroy the inode while we are still using it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4854
2016-07-19 09:23:58 -07:00
Tyler J. Stachecki
3d11ecbddd Prevent segfaults in SSE optimized Fletcher-4
In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17.  This issue
was fixed in gcc 4.6.

To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.

Disable zimport testing against master where this flaw exists:

TEST_ZIMPORT_VERSIONS="installed"

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4862
2016-07-19 09:03:44 -07:00
Roman Strashkin
1b87e0f532 Fix filesystem destroy with receive_resume_token
It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'.  Try to remove the hidden child (%recv) and after
that try to remove the target dataset.   If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.

Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4818
2016-07-15 15:34:46 -07:00
Tyler J. Stachecki
35a76a0366 Implementation of SSE optimized Fletcher-4
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4789
2016-07-15 10:42:35 -07:00
Chris Dunlop
dfbc86309f Use native inode->i_nlink instead of znode->z_links
A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.

We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.

In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4838
Issue #227
2016-07-14 16:25:34 -07:00
Chunwei Chen
02de3e3c5d Fix dbuf_stats_hash_table_data race
Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4846
2016-07-14 16:25:04 -07:00
Tim Chase
8887c7d778 Prevent null dereferences when accessing dbuf kstat
In arc_buf_info(), the arc_buf_t may have no header.  If not, don't try
to fetch the arc buffer stats and instead just zero them.

The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4837
2016-07-14 16:25:03 -07:00
Gvozden Neskovic
ae25d22235 Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.
The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4783
2016-07-13 10:24:55 -07:00
Gvozden Neskovic
1bf3bf0e29 Fix handling of errors nvlist in zfs_ioc_recv_new()
zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4829
2016-07-13 10:20:08 -07:00
Peng
81edd3e834 Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3937
2016-07-12 16:47:44 -07:00
Chunwei Chen
31b6111fd9 Kill zp->z_xattr_parent to prevent pinning
zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.

This change partially reverts e89260a.  This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
2016-07-12 14:18:10 -07:00
Chunwei Chen
ddae16a9cf xattr dir doesn't get purged during iput
We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
2016-07-12 14:04:30 -07:00
Chunwei Chen
6c2530647c fh_to_dentry should return ESTALE when generation mismatch
When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
2016-07-12 13:34:15 -07:00
Chunwei Chen
bffb68a2b8 Fix Large kmem_alloc in vdev_metaslab_init
This allocation can go way over 1MB, so we should use vmem_alloc
instead of kmem_alloc.

  Large kmem_alloc(1430784, 0x1000), please file an issue...
  Call Trace:
   [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl]
   [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs]
   [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs]
   [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs]
   [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs]
   [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs]
   [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs]
   [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs]
   [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0
   [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4752
2016-07-12 13:34:15 -07:00
Chunwei Chen
7938c2aca7 Don't allow accessing XATTR via export handle
Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
2016-07-12 13:34:14 -07:00
Chunwei Chen
061460dfe2 Fix get_zfs_sb race with concurrent umount
Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.

P1                                          P2
---                                         ---
deactivate_locked_super(): s_active = 0
                                            zfs_sb_hold()
                                            ->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
                                            zfs_sb_rele(zsb)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-07-12 13:34:14 -07:00
Gvozden Neskovic
590c9a0994 Allow building with CFLAGS="-O0"
If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.

Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4799
2016-07-11 16:53:02 -07:00
Brian Behlendorf
0dab2e84fc Vectorized fletcher_4 must be 128-bit aligned
The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned.  This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations.  Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4330
2016-06-29 11:22:22 -07:00
Paul Dagnelie
d1d19c7854 OpenZFS 6876 - Stack corruption after importing a pool with a too-long name
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.

OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e
2016-06-28 13:47:04 -07:00
Igor Kozhukhov
eca7b76001 OpenZFS 6314 - buffer overflow in dsl_dataset_name
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee
2016-06-28 13:47:03 -07:00
Brian Behlendorf
43e52eddb1 Implement zfs_ioc_recv_new() for OpenZFS 2605
Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface.  The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.

ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler.  Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-06-28 13:47:03 -07:00
Dan McDonald
8c62a0d0f3 OpenZFS 6562 - Refquota on receive doesn't account for overage
Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6
2016-06-28 13:47:03 -07:00
Dan McDonald
671c93546c OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota
Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad
2016-06-28 13:47:03 -07:00
Eli Rosenthal
f8866f8ae3 OpenZFS 6738 - zfs send stream padding needs documentation
Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6738
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff
2016-06-28 13:47:03 -07:00
Andrew Stormont
b607405fea OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6536
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b
2016-06-28 13:47:03 -07:00
Paul Dagnelie
e6d3a843d6 OpenZFS 6393 - zfs receive a full send as a clone
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e
2016-06-28 13:47:03 -07:00
Matthew Ahrens
47dfff3b86 OpenZFS 2605, 6980, 6902
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
2016-06-28 13:47:02 -07:00
Ned Bass
50c957f702 Implement large_dnode pool feature
Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-06-24 13:13:21 -07:00
Ned Bass
68cbd56e18 Backfill metadnode more intelligently
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates.  As summarized by
@mahrens in #4636:

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-06-24 13:13:12 -07:00
smh
d14fa5dba1 FreeBSD rS271776 - Persist vdev_resilver_txg changes
Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

Authored-by: smh <smh@FreeBSD.org>
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5154
FreeBSD-issue: https://reviews.freebsd.org/rS271776
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf
Closes #4790
2016-06-24 13:02:42 -07:00
Nav Ravindranath
784d15c14c OpenZFS 6878 - Add scrub completion info to "zpool history"
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Authored by: Nav Ravindranath <nav@delphix.com>
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6878
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5
Closes #4787
2016-06-24 13:01:31 -07:00
Paul Dagnelie
bc77ba73fe OpenZFS 6513 - partially filled holes lose birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>a
Ported by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0

If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.

Fix is to modify the dbuf_read code to fill in birth time data.
2016-06-21 10:55:13 -07:00
Chunwei Chen
100a91aa3e Fix NFS credential
The commit f74b821 caused a regression where creating file through NFS will
always create a file owned by root. This is because the patch enables the KSID
code in zfs_acl_ids_create, which it would use euid and egid of the current
process. However, on Linux, we should use fsuid and fsgid for file operations,
which is the original behaviour. So we revert this part of code.

The patch also enables secpolicy_vnode_*, since they are also used in file
operations, we change them to use fsuid and fsgid.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4772
Closes #4758
2016-06-21 09:58:37 -07:00
Gvozden Neskovic
ab9f4b0b82 SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-06-21 09:27:26 -07:00
Tim Chase
09fb30e5e9 Linux 4.6 compat: Fall back to d_prune_aliases() if necessary
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: #4726
2016-06-17 13:33:49 -07:00
Brian Behlendorf
f74b821a66 Add zfs allow and zfs unallow support
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487
2016-06-07 09:16:52 -07:00
Jinshan Xiong
1eeb4562a7 Implementation of AVX2 optimized Fletcher-4
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4330
2016-06-02 14:30:51 -07:00
Jinshan Xiong
16fc1ec3ba Improve spl slab cache alloc
The policy is to try to allocate with KM_NOSLEEP, which will lead to
memory allocation with GFP_ATOMIC, and if it fails, it will launch
an taskq to expand slab space.

This way it should be able to get better NUMA memory locality and
reduce the overhead of context switch.

Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #551
2016-06-01 10:26:42 -07:00
Chunwei Chen
6a79672530 Fix memleak in vdev_config_generate_stats
fnvlist_add_nvlist will copy the contents of nvx, so we need to
free it here.

unreferenced object 0xffff8800a6934e80 (size 64):
  comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s)
  hex dump (first 32 bytes):
    60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff  `..s.....|.s....
    00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff  ........@.p.....
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310
    [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl]
    [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl]
    [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
    [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair]
    [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair]
    [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair]
    [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs]
    [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs]
    [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs]
    [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs]
    [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs]
    [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs]
    [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0
    [<ffffffff812333b9>] SyS_ioctl+0x79/0x90

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4707
Issue #4708
2016-05-31 16:05:21 -07:00
Chunwei Chen
06ee0031a6 Fix memleak in zpl_parse_options
strsep() will advance tmp_mntopts, and will change it to NULL on last
iteration.  This will cause strfree(tmp_mntopts) to not free anything.

unreferenced object 0xffff8800883976c0 (size 64):
  comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s)
  hex dump (first 32 bytes):
    72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a  rw.strictatime.z
    66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d  fsutil.mntpoint=
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811f9cac>] __kmalloc+0x16c/0x250
    [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl]
    [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs]
    [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs]
    [<ffffffff81222dc8>] mount_fs+0x38/0x160
    [<ffffffff81240097>] vfs_kern_mount+0x67/0x110
    [<ffffffff812428e0>] do_mount+0x250/0xe20
    [<ffffffff812437d5>] SyS_mount+0x95/0xe0
    [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4706
Issue #4708
2016-05-31 16:04:26 -07:00
Chunwei Chen
540c392793 Fix out-of-bound access in zfs_fillpage
The original code will do an out-of-bound access on pl[] during last
iteration.

 ==================================================================
 BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
 Read of size 8 by task tmpfile/7850
 page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
 flags: 0xffff8000000000()
 page dumped because: kasan: bad access detected
 CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
  ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
  ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
  ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
 Call Trace:
  [<ffffffff81635618>] dump_stack+0x63/0x8b
  [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
  [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
  [<ffffffff813144b8>] kasan_report+0x58/0x60
  [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
  [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]

  [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
  [<ffffffff810058ef>] do_syscall_64+0xef/0x180
  [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
 Memory state around the buggy address:
  ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                 ^
  ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ==================================================================

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4705
Issue #4708
2016-05-31 16:01:27 -07:00
Chunwei Chen
ea5f1a200b Fix use-after-free in splat_taskq_test7
This splat_vprint is using tq_arg->name after tq_arg is freed.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #557
2016-05-31 11:58:42 -07:00
Chunwei Chen
f58040c0fc Implement a proper rw_tryupgrade
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes #554
2016-05-31 11:44:15 -07:00
GeLiXin
b7faa7aabd Fix self-healing IO prior to dsl_pool_init() completion
Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization.  This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().

George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.

However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes().  This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.

Signed-off-by: GeLiXin <47034221@qq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4652
2016-05-27 14:11:25 -07:00
Tony Hutter
26ef0cc7db OpenZFS 6531 - Provide mechanism to artificially limit disk performance
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
2016-05-26 10:11:51 -07:00
Tony Hutter
7e945072d1 Add request size histograms (-r) to zpool iostat, minor man page fix
Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4659
2016-05-25 15:49:35 -07:00
Chunwei Chen
4442f60d8e Fix arc_prune_task use-after-free
arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent
the underlying zsb from disappearing if there's a concurrent umount. We fix
this by force the caller of arc_remove_prune_callback to wait for
arc_prune_taskq to finish.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4687
Closes #4690
2016-05-25 14:11:53 -07:00
Chunwei Chen
b3a22a0a00 Fix taskq_wait_outstanding re-evaluate tq_next_id
wait_event is a macro, so the current implementation will cause re-
evaluation of tq_next_id every time it wakes up. This would cause
taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
2016-05-24 13:02:10 -07:00
Chunwei Chen
5ce028b0d4 Fix race between taskq_destroy and dynamic spawning thread
While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it
does not implies the thread being spawned is up and running. This will cause
taskq to be freed before the thread can exit.

We fix this by using tq_nspawn to indicate how many threads are being spawned
before they are inserted to the thread list. And have taskq_destroy to wait
for it to drop to zero.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Closes #550
2016-05-24 13:00:17 -07:00
Chunwei Chen
872e0cc9c7 Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires
In 39cd90e, I mistakenly disabled the ability of using absolute expire time in
cv_timedwait_hires. I don't quite sure why I did that, so let's restore it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
2016-05-24 12:58:49 -07:00
Chunwei Chen
cbecb4fb22 Skip ctldir znode in zfs_rezget to fix snapdir issues
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4514
Closes #4661
Closes #4672
2016-05-23 11:06:56 -07:00
Chunwei Chen
9baaa7deae Linux 4.7 compat: use iterate_shared for concurrent readdir
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.

Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4664
Closes #4665
2016-05-20 11:09:16 -07:00
Chunwei Chen
68e8f59afb Linux 4.7 compat: replace blk_queue_flush with blk_queue_write_cache
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4665
2016-05-20 11:08:55 -07:00
Chunwei Chen
fdbc1ba99d Linux 4.7 compat: inode_lock() and friends
Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
inode_lock_shared to do exclusive and shared lock respectively.

We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
kernel you'll always take an exclusive lock.

We also add all other inode_lock friends. And nested users now should
explicitly call spl_inode_lock_nested with correct subclass.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4665
Closes #549
2016-05-20 11:00:14 -07:00
Nikolay Borisov
278f223668 Kill znode->z_gen field
This field is a duplicate of the inode->i_generation, so just
kill it.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4538
Closes #4654
2016-05-19 13:06:14 -07:00
Brian Behlendorf
ada8258141 Revert "zhack: Add 'feature disable' command"
This reverts commit 8302528617 and
ebecfcd699 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b541574.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:52:07 -07:00
Marcel Huber
2587cd8f93 Fixes subtle bug in zio_handle_io_delay()
Fixed bug introduced in commit #c35b1882.  Hinted by gcc:

zio_inject.c: In function ‘zio_handle_io_delay’:
zio_inject.c:382:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
   if (handler->zi_record.zi_freq != 0 &&
      ^~
      zio_inject.c:384:4: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
          continue;
	      ^~~~~~~~

Signed-off-by: Marcel Huber <marcelhuberfoo@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4632
2016-05-17 11:03:11 -07:00
Brian Behlendorf
8302528617 zhack: Add 'feature disable' command
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:00:21 -07:00
Chunwei Chen
d88895a069 Remove dummy znode from zvol_state
struct zvol_state contains a dummy znode, which is around 1KB on x64,
only for zfs_range_lock. But in reality, other than z_range_lock and
z_range_avl, zfs_range_lock only need znode on regular file, which
means we add 1KB on a structure and gain nothing.

In this patch, we remove the dummy znode for zvol_state. In order to
do that, we also need to refactor zfs_range_lock a bit. We move
z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t.
This new struct replaces znode_t as the main handle inside the range
lock functions.

We also add pointers to z_size, z_blksz, and z_max_blksz so range lock
code doesn't depend on znode_t.  This allows non-ZPL consumers like
Lustre to use the range locks with their equivalent znode_t structure.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4510
2016-05-17 10:29:02 -07:00
Chunwei Chen
a9bb2b6827 Use cv_timedwait_sig_hires in arc_reclaim_thread
The was originally using interruptible cv_timedwait_sig, but was changed
to uninterruptible cv_timedwait_hires in ae6d0c6. Use _sig_hires instead
to allow interruptible sleep.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4633
Closes #4634
2016-05-12 14:56:47 -07:00
Chunwei Chen
39cd90ef08 Add cv_timedwait_sig_hires to allow interruptible sleep
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #548
2016-05-12 14:54:15 -07:00
Dan McDonald
8adb798aa5 OpenZFS 6093 - zfsctl_shares_lookup
6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6093
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0f92170
Closes #4630

This function was always implemented slightly differently under Linux
and therefore never suffered from this issue.  The patch has been
updated and applied as cleanup in order to minimize differences with
the upstream OpenZFS code.
2016-05-12 13:39:06 -07:00
Brian Behlendorf
c15706490e Revert "Kill znode->z_gen field"
This reverts commit 4cd77889b6.  The
i_generation field in the inode is 32-bit and the SA code expects
64-bit fixed values.  Revert this optimization for now until
this is cleanly addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4538
2016-05-12 13:36:22 -07:00
Tony Hutter
193a37cb24 Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues
Update the zfs module to collect statistics on average latencies, queue sizes,
and keep an internal histogram of all IO latencies.  Along with this, update
"zpool iostat" with some new options to print out the stats:

-l: Include average IO latencies stats:

 total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
 read  write   read  write   read  write   read  write   wait
-----  -----  -----  -----  -----  -----  -----  -----  -----
    -   41ms      -    2ms      -   46ms      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -      -      -      -      -      -      -      -      -
    -   49ms      -    2ms      -   47ms      -      -      -
    -      -      -      -      -      -      -      -      -
    -    2ms      -    1ms      -      -      -    1ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  2ms    1ms    2ms  412us   26us   25us      -    5ms      -
    -    1ms      -  413us      -   25us      -    5ms      -
    -    1ms      -  460us      -   29us      -    5ms      -
196us    1ms  196us  370us    7us   23us      -    5ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----

-w: Print out latency histograms:

sdb           total           disk         sync_queue      async_queue
latency    read   write    read   write    read   write    read   write   scrub
-------  ------  ------  ------  ------  ------  ------  ------  ------  ------
1ns           0       0       0       0       0       0       0       0       0
...
33us          0       0       0       0       0       0       0       0       0
66us          0       0     107    2486       2     788      12      12       0
131us         2     797     359    4499      10     558     184     184       6
262us        22     801     264    1563      10     286     287     287      24
524us        87     575      71   52086      15    1063     136     136      92
1ms         152    1190       5   41292       4    1693     252     252     141
2ms         245    2018       0   50007       0    2322     371     371     220
4ms         189    7455      22  162957       0    3912    6726    6726     199
8ms         108    9461       0  102320       0    5775    2526    2526      86
17ms         23   11287       0   37142       0    8043    1813    1813      19
34ms          0   14725       0   24015       0   11732    3071    3071       0
67ms          0   23597       0    7914       0   18113    5025    5025       0
134ms         0   33798       0     254       0   25755    7326    7326       0
268ms         0   51780       0      12       0   41593   10002   10002       0
537ms         0   77808       0       0       0   64255   13120   13120       0
1s            0  105281       0       0       0   83805   20841   20841       0
2s            0   88248       0       0       0   73772   14006   14006       0
4s            0   47266       0       0       0   29783   17176   17176       0
9s            0   10460       0       0       0    4130    6295    6295       0
17s           0       0       0       0       0       0       0       0       0
34s           0       0       0       0       0       0       0       0       0
69s           0       0       0       0       0       0       0       0       0
137s          0       0       0       0       0       0       0       0       0
-------------------------------------------------------------------------------

-h: Help

-H: Scripted mode. Do not display headers, and separate fields by a single
    tab instead of arbitrary space.

-q: Include current number of entries in sync & async read/write queues,
    and scrub queue:

 syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
 pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0    227    394      0     19      0      0      0      0
    0      0    227    394      0     19      0      0      0      0
    0      0    108     98      0     19      0      0      0      0
    0      0     19     98      0      0      0      0      0      0
    0      0     78     98      0      0      0      0      0      0
    0      0     19     88      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----

-p: Display numbers in parseable (exact) values.

Also, update iostat syntax to allow the user to specify specific vdevs
to show statistics for.  The three options for choosing pools/vdevs are:

Display a list of pools:
    zpool iostat ... [pool ...]

Display a list of vdevs from a specific pool:
    zpool iostat ... [pool vdev ...]

Display a list of vdevs from any pools:
    zpool iostat ... [vdev ...]

Lastly, allow zpool command "interval" value to be floating point:
    zpool iostat -v 0.5

Signed-off-by: Tony Hutter <hutter2@llnl.gov
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4433
2016-05-12 12:36:32 -07:00
Nikolay Borisov
04bc461062 Reduce stack usage of dmu_recv_stream function
The receive_writer_arg and receive_arg structures become large
when ZFS is compiled with debugging enabled. This results in
gcc throwing an error about excessive stack usage:

  module/zfs/dmu_send.c: In function ‘dmu_recv_stream’:
  module/zfs/dmu_send.c:2502:1: error: the frame size of 1256 bytes is
  larger than 1024 bytes [-Werror=frame-larger-than=]

Fix this by allocating those functions on the heap, rather than
on the stack.

With patch:    dmu_send.c:2350:1:dmu_recv_stream 240 static
Without patch: dmu_send.c:2350:1:dmu_recv_stream 1336 static

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4620
2016-05-11 12:14:24 -07:00
Chunwei Chen
32c8c946ea OpenZFS 6842 - Fix empty xattr dir causing lockup
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Denys Rtveliashvili <denys@rtveliashvili.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

An initial version of this patch was applied in commit 29572cc and
subsequently refined upstream.  Since the implementations do not
conflict with each other both are left applied for now.

OpenZFS-issue: https://www.illumos.org/issues/6842
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/02525cd
Closes #4615
2016-05-10 10:38:21 -07:00
Brian Behlendorf
33cf67cd9a Wrap vdev_count_verify_zaps() with ZFS_DEBUG
Commit e0ab3ab introduced two blocks of code which are only needed
when debugging is enabled.  These blocks should be wrapped with
ZFS_DEBUG for clarity and to prevent unused variable warnings in
a production build.

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4515
2016-05-06 18:14:03 -07:00
David Quigley
ae6d0c601e OpenZFS 6672 - arc_reclaim_thread() should use gethrtime()
6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: David Quigley <dpquigl@davequigley.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6672
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/571be5c
Closes #4600
2016-05-06 09:35:52 -07:00
Brian Behlendorf
4b2a3e0c9d OpenZFS 6286 - ZFS internal error when set large block on bootfs
6286 ZFS internal error when set large block on bootfs
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6286
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6de9bb5
Closes #4585
2016-05-05 16:19:12 -07:00
Joe Stein
e0ab3ab553 OpenZFS 6736 - ZFS per-vdev ZAPs
6736 ZFS per-vdev ZAPs
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/6736
  https://github.com/openzfs/openzfs/commit/215198a

Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4515
2016-05-02 14:27:45 -07:00
Nikolay Borisov
4cd77889b6 Kill znode->z_gen field
This field is a duplicate of the inode->i_generation, so just kill it

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4538
2016-05-02 11:22:31 -07:00
Tim Chase
ddab862d4c Enable PF_FSTRANS for ioctl secpolicy callbacks (#4571)
At the very least, the zfs_secpolicy_write_perms ioctl security policy
callback, which calls dsl_dataset_hold(), can require freeing memory and,
therefore, re-enter ZFS.  This patch enables PF_FSTRANS for all of the
security policy callbacks similarly to the manner in which it's enabled
for the actual ioctl callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4554
2016-05-02 10:00:50 -07:00
Vitaut Bajaryn
ef1c27117b module/.gitignore: Add *.dwo (#4580)
These files get generated when CONFIG_DEBUG_INFO_DWARF4 is enabled in
Linux .config.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4580
2016-05-02 09:07:04 -07:00
Alex Reece
463a8cfe2b Illumos 6844 - dnode_next_offset can detect fictional holes
6844 dnode_next_offset can detect fictional holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>

dnode_next_offset is used in a variety of places to iterate over the
holes or allocated blocks in a dnode. It operates under the premise that
it can iterate over the blockpointers of a dnode in open context while
holding only the dn_struct_rwlock as reader. Unfortunately, this premise
does not hold.

When we create the zio for a dbuf, we pass in the actual block pointer
in the indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the
bp is now inconsistent from the perspective of dnode_next_offset: the bp
will appear to be a hole until zio_dva_allocate finally finishes filling
it in. In the meantime, dnode_next_offset can detect a hole in the dnode
when none exists.

I was able to experimentally demonstrate this behavior with the
following setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the
first L0 block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the
middle of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent
hole.

The fix is to ensure that it is valid to iterate over the indirect
blocks in a dnode while holding the dn_struct_rwlock by passing the zio
a copy of the BP and updating the actual BP in dbuf_write_ready while
holding the lock.

References:
  https://www.illumos.org/issues/6844
  https://github.com/openzfs/openzfs/pull/82
  DLPX-35372

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4548
2016-04-27 16:24:15 -07:00
Josef 'Jeff' Sipek
8a5fc74880 Illumos 6659 - nvlist_free(NULL) is a no-op
6659 nvlist_free(NULL) is a no-op
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6659
  https://github.com/illumos/illumos-gate/commit/aab83bb

Ported-by: David Quigley <dpquigl@davequigley.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4566
2016-04-27 15:58:23 -07:00
Tim Chase
ea2633ad26 Clear PF_FSTRANS over spl_filp_fallocate()
The problem described in 2a5d574 also applies to XFS's file or inode
fallocate method.  Both paths may trigger writeback and expose this
issue, see the full stack below.

When layered on XFS a warning will be emitted under CentOS7 when entering
either the file or inode fallocate method with PF_FSTRANS already set.
To avoid triggering this error PF_FSTRANS is cleared and then reset
in vn_space().

WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0

Call Trace:
 [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0
 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
 [<ffffffff81173ed7>] __writepage+0x17/0x40
 [<ffffffff81176f81>] write_cache_pages+0x251/0x530
 [<ffffffff811772b1>] generic_writepages+0x51/0x80
 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs]
 [<ffffffff81177300>] do_writepages+0x20/0x30
 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100
 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0
 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs]
 [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs]
 [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl]
 [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs]
 [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs]
 [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs]
 [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl]
 [<ffffffff810c1777>] kthread+0xd7/0xf0
 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes zfsonlinux/zfs#4529
2016-04-26 11:22:43 -07:00
Brian Behlendorf
2d82ea8b11 Use udev for partition detection
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time.  In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked.  However, if the rare event it is needed
it will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #4523
Closes #3708
Closes #4077
Closes #4144
Closes #4214
Closes #4517
2016-04-25 11:13:20 -07:00
Chunwei Chen
232604b58e Linux 4.5 compat: Use xattr_handler->name for acl
Linux 4.5 added member "name" to xattr_handler. xattr_handler which matches to
whole name rather than prefix should use "name" instead of "prefix".
Otherwise, kernel will return with EINVAL when it tries to resolve handlers.

Also, we remove the strcmp checks when xattr_handler has name, because
xattr_resolve_name will do the check for us.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4549
Closes #4537
2016-04-25 08:42:08 -07:00
Brian Behlendorf
da5e151f20 Add pn_alloc()/pn_free() functions
In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and
pn_free() functions must be implemented.  The existing illumos
implementation were used for this purpose.

The `flags` argument which was used in places wrapped by the
HAVE_PN_UTILS condition has beed added back to zfs_remove() and
zfs_link() functions.  This removes a small point of divergence
between the ZoL code and upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4522
2016-04-21 09:49:25 -07:00
Ned Bass
98f03691a4 Fix ZPL miswrite of default POSIX ACL
Commit 4967a3e introduced a typo that caused the ZPL to store the
intended default ACL as an access ACL. Due to caching this problem
may not become visible until the filesystem is remounted or the inode
is evicted from the cache. Fix the typo and add a regression test.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4520
2016-04-18 11:26:55 -07:00
Colin Ian King
4903926f89 Fix inverted logic on none elevator comparison
Commit d1d7e2689d ("cstyle: Resolve C style issues") inverted
the logic on the none elevator comparison.  Fix this and make it
cstyle warning clean.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4507
2016-04-15 12:18:08 -07:00
Chunwei Chen
704cd0758a Enable lazytime semantic for atime
Linux 4.0 introduces lazytime. The idea is that when we update the atime, we
delay writing it to disk for as long as it is reasonably possible.

When lazytime is enabled, dirty_inode will be called with only I_DIRTY_TIME
flag whenever i_atime is updated. So under such condition, we will set
z_atime_dirty. We will only write it to disk if file is closed, inode is
evicted or setattr is called. Ideally, we should also write it whenever SA
is going to be updated, but it is left for future improvement.

There's one thing that we should take care of now that we allow i_atime to be
dirty. In original implementation, whenever SA is modified, zfs_inode_update
will be called to overwrite every thing in inode. This will cause dirty
i_atime to be discarded. We fix this by don't overwrite i_atime in
zfs_inode_update. We only overwrite i_atime when allocating new inode or doing
zfs_rezget with zfs_inode_update_new.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
2016-04-05 18:55:51 -07:00
Chunwei Chen
0df9673f01 Fix atime handling and relatime
The problem for atime:

We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its
handling is a mess. A huge part of mess regarding atime comes from
zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave
inconsistently with those three values.

zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you
don't pass ATTR_ATIME. Which means every write(2) operation which only updates
ctime and mtime will cause atime changes to not be written to disk.

Also zfs_inode_update from write(2) will replace inode->i_atime with what's
inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2).
You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0.

Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's
inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new),
SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll
leave with a stale atime.

The problem for relatime:

We do have a relatime config inside ZFS dataset, but how it should interact
with the mount flag MS_RELATIME is not well defined. It seems it wanted
relatime mount option to override the dataset config by showing it as
temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would
also seems to want to override the mount option. Not to mention that
MS_RELATIME flag is actually never passed into ZFS, so it never really worked.

How Linux handles atime:

The Linux kernel actually handles atime completely in VFS, except for writing
it to disk. So if we remove the atime handling in ZFS, things would just work,
no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever
VFS updates the i_atime, it will notify the underlying filesystem via
sb->dirty_inode().

And also there's one thing to note about atime flags like MS_RELATIME and
other flags like MS_NODEV, etc. They are mount point flags rather than
filesystem(sb) flags. Since native linux filesystem can be mounted at multiple
places at the same time, they can all have different atime settings. So these
flags are never passed down to filesystem drivers.

What this patch tries to do:

We remove znode->z_atime, since we won't gain anything from it. We remove most
of the atime handling and leave it to VFS. The only thing we do with atime is
to write it when dirty_inode() or setattr() is called. We also add
file_accessed() in zpl_read() since it's not provided in vfs_read().

After this patch, only the MS_RELATIME flag will have effect. The setting in
dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME
set according to the setting in dataset in future patch.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
2016-04-05 18:54:55 -07:00
Brian Behlendorf
8b1899d3fb Linux 4.6 compat: PAGE_CACHE_SIZE removal
As described in torvalds/linux@4a2d057e the macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were originally introduced
to make it possible to add bigger chunks to the page cache.  This
never panned out and it has therefore been removed from the kernel.

ZFS has been updated to use the PAGE_{SIZE,SHIFT,MASK,ALIGN} macros
and calls to page_cache_release() have been replaced with put_page().

There was no need to introduce a configure check for this because
these interfaces have existed for a very long time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4489
2016-04-05 17:26:56 -07:00
Colin Ian King
f7b939bdbc Add support 32 bit FS_IOC32_{GET|SET}FLAGS compat ioctls
We need 32 bit userspace FS_IOC32_GETFLAGS and FS_IOC32_SETFLAGS
compat ioctls for systems such as powerpc64.  We use the normal
compat ioctl idiom as used by a variety of file systems to provide
this support.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4477
2016-03-31 17:56:12 -07:00
DHE
e3e7cf6026 gcc build error: -Wbool-compare in metaslab.c
When debugging is enabled on a very recent version of gcc
(tested with 5.3.0), DVA_SET_GANG(dva, !!(flags)) fails
because an assertion causes a comparison between what is
technically a boolean and an integer.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4465
2016-03-30 09:36:51 -07:00
Brian Behlendorf
c35b188246 Fix zpool_scrub_* test cases
The zpool_scrub_002, zpool_scrub_003, zpool_scrub_004 test cases fail
reliably when running against small pools or fast storage.  This
occurs because the scrub/resilver operation completes before subsequent
commands can be run.

A one second delay has been added to 10% of zio's in order to ensure
the scrub/resilver operation will run for at least several seconds.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4450
2016-03-30 09:30:34 -07:00
Tim Chase
7bb5d92de8 Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq
When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another
thread to be spawned.  This will cause TQ_NOQUEUE to behave similarly
as it does with non-dynamic taskqs.

Add support for TQ_NOQUEUE to taskq_dispatch_ent().

Signed-off-by: Tim Chase <tim@onlight.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #530
2016-03-17 09:52:35 -07:00
Alex Wilson
887d1e60ef Illumos 6681 - zfs list burning lots of time in dodefault() via dsl_prop_*
6681 zfs list burning lots of time in dodefault() via dsl_prop_*
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/6681
  https://github.com/illumos/illumos-gate/commit/d09e447

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4406
2016-03-15 18:46:44 -07:00
Paul Dagnelie
c352ec27d5 Illumos 6370 - ZFS send fails to transmit some holes
6370 ZFS send fails to transmit some holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6370
  https://github.com/illumos/illumos-gate/commit/286ef71

In certain circumstances, "zfs send -i" (incremental send) can produce
a stream which will result in incorrect sparse file contents on the
target.

The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).

Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).

Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).

We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).

The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).

Ported-by: Steven Burgess <sburgess@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4369
Closes #4050
2016-03-10 14:25:22 -08:00
Brian Behlendorf
a6ae97caed Add rw_tryupgrade()
This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms.  It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held.  On other platforms the lock is never released during
the upgrade process.  This is necessary under Linux because the kernel
does not provide an upgrade function.

There are currently no callers in the ZFS code where this change in
behavior is a problem.  In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes #534
2016-03-10 13:05:25 -08:00
Boris Protopopov
1ee159f423 Fix lock order inversion with zvol_open()
zfsonlinux issue #3681 - lock order inversion between zvol_open() and
dsl_pool_sync()...zvol_rename_minors()

Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3681
2016-03-10 09:53:36 -08:00
Boris Protopopov
a0bd735adb Add support for asynchronous zvol minor operations
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset

zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()

Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.

* Each taskq must be single threaded to ensure tasks are always
  processed in the order in which they were dispatched.

* There is a taskq per-pool in order to keep the pools independent.
  This way if one pool is suspended it will not impact another.

* The preferred location to dispatch a zvol minor task is a sync
  task.  In this context there is easy access to the spa_t and
  minimal error handling is required because the sync task must
  succeed.

Support for asynchronous zvol minor operations address issue #3681.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2217
Closes #3678
Closes #3681
2016-03-10 09:49:22 -08:00
Tim Chase
f764edf016 Change KM_SLEEP to TQ_SLEEP in spa_deadman()
Since they both evaluate to zero, this is a semi-cosmetic change
but the latter is the proper value to use as an argument to
taskq_dispatch_delay().

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4393
2016-03-09 10:41:31 -08:00
ab-oe
513168abd2 Make zvol update volsize operation synchronous.
There is a race condition when new transaction group is added
to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize.
Meanwhile, before these dirty data are synchronized, the receive process
can cause that dmu_recv_end_sync is executed. Then finally dirty data
are going to be synchronized but the synchronization ends with the NULL
pointer dereference error.

Signed-off-by: ab-oe <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4116
2016-02-29 09:07:27 -08:00
smh
9f500936c8 FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4334
2016-02-26 11:24:35 -08:00
Brian Behlendorf
8a09d5fd46 Add l2arc_max_block_size tunable
Set a limit for the largest compressed block which can be written
to an L2ARC device.  By default this limit is set to 16M so there
is no change in behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4323
2016-02-25 09:44:00 -08:00
Boris Protopopov
5428dc51fb Make zvol minor functionality more robust
Close the race window in zvol_open() to prevent removal of
zvol_state in the 'first open' code path. Move the call to
check_disk_change() under zvol_state_lock to make sure the
zvol_media_changed() and zvol_revalidate_disk() called by
check_disk_change() are invoked with positive zv_open_count.

Skip opened zvols when removing minors and set private_data
to NULL for zvols that are not in use whose minors are being
removed, to indicate to zvol_open() that the state is gone.
Skip opened zvols when renaming minors to avoid modifying
zv_name that might be in use, e.g. in zvol_ioctl().

Drop zvol_state_lock before calling add_disk() when creating
minors to avoid deadlocks with zvol_open().

Wrap dmu_objset_find() with spl_fstran_mark()/unmark().

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4344
2016-02-24 11:54:24 -08:00
Richard Yao
19a47cb1c2 Call dmu_read_uio_dbuf() in zvol_read()
The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is
that the former takes a hold while the latter uses an existing hold.
`zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while
our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no
apparent reason, we inherited a `zvol_read()` function from
OpenSolaris that does `dmu_read_uio()`. illumos-gate also still
uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`,
which is more performant.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4316
2016-02-18 10:24:33 -08:00
Richard Yao
a765a34a31 Clean up zvol request processing to pass uio and fix porting regressions
In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t
rather than bio_t. Since we are translating from bio to uio for both, we
might as well unify the logic and have code more similar to its illumos
counterpart. At the same time, we can fix some regressions that occurred
versus the original code from illumos-gate.

We refactor zvol_write to take uio and also correct the
following problems:

1. We did `dnode_hold()` on each IO when we already had a hold.
2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the
DMU.
3. We could call `zil_commit()` twice. In this case, this is because
Linux uses the `->write` function to send flushes and can aggregate the
flush with a write. If a synchronous write occurred with the flush, we
effectively flushed twice when there is no need to do that.

zvol_read also suffers from the first two problems. Other platforms
suffer from the first, so we leave that for a second patch so that there
is a discrete patch for them to cherry-pick.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4316
2016-02-18 10:23:30 -08:00
Chunwei Chen
093911f194 Remove wrong ASSERT in annotate_ecksum
When using large blocks like 1M, there will be more than UINT16_MAX qwords in
one block, so this ASSERT would go off. Also, it is possible for the histogram
to overflow. We cap them to UINT16_MAX to prevent this.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4257
2016-02-17 10:43:02 -08:00
Richard Yao
0b43696e66 random_get_pseudo_bytes() need not provide cryptographic strength entropy
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:

http://xorshift.di.unimi.it/#speed

The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.

Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.

We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:

1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.

2.  Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.

3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.

4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.

5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.

6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`).  In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372
2016-02-17 09:49:09 -08:00
Paul Dagnelie
6b42ea8590 Illumos 5809 - Blowaway full receive in v1 pool causes kernel panic
5809 Blowaway full receive in v1 pool causes kernel panic
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5809
  https://github.com/illumos/illumos-gate/commit/f40b29c

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-02-08 09:37:55 -08:00
Chunwei Chen
8f3b403a73 Allow kicking a taskq to spawn more threads
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529
2016-02-05 14:08:31 -08:00
Gary Mills
7c9abfa7ab Illumos 6537 - Panic on zpool scrub with DEBUG kernel
6537 Panic on zpool scrub with DEBUG kernel
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/6537
  https://github.com/illumos/illumos-gate/commit/8c04a1f

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-02-05 11:29:32 -08:00
Dan McDonald
a4d179efa9 Illumos 6096 - ZFS_SMB_ACL_RENAME needs to cleanup better
6096 ZFS_SMB_ACL_RENAME needs to cleanup better
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6096
  https://github.com/illumos/illumos-gate/commit/8f5190a5

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-02-05 11:28:53 -08:00
Matthew Ahrens
b77222c873 Illumos 6450 - scrub/resilver unnecessarily traverses snapshots
6450 scrub/resilver unnecessarily traverses snapshots created
after the scrub started
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6450
  https://github.com/illumos/illumos-gate/commit/38d6103

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-02-02 14:00:24 -08:00
Richard Sharpe
9d36cdb650 Handling negative dentries in a CI file system.
For a Case Insensitive file system we must avoid creating negative
entries in the dentry cache. We must also pass the FIGNORECASE into
zfs_lookup so that special files are handled correctly.

We must also prevent negative dentries from being created when files are
unlinked.

Tested by running fsstress from LTP (10 loops, 10 processes, 10,000 ops.)

Also tested with printks (now removed) to ensure that lookups come to
zpl_lookup when negative should not exist.

Tests:
1.   ls Some-file.txt; touch some-file.txt; ls Some-file.txt
  and ensure no errors.

2.   touch Some-file.txt; rm some-file.txt; ls Some-file.txt
  and ensure that the last ls shows log messages showing the lookup
  went all the way to zpl_lookup.

Thanks to tuxoko for helping me get this correct.

Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4243
2016-02-02 13:59:00 -08:00
Jorgen Lundman
4b9ed698b4 Illumos 6527 - Possible access beyond end of string in zpool comment
6527 Possible access beyond end of string in zpool comment
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/6527
  https://github.com/illumos/illumos-gate/commit/2bd7a8d

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
2016-01-28 12:46:04 -05:00
Brian Behlendorf
e56766360b Illumos 6495 - Fix mutex leak in dmu_objset_find_dp
6495 Fix mutex leak in dmu_objset_find_dp
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Albert Lee <trisk@omniti.com>

References:
  https://www.illumos.org/issues/6495
  https://github.com/illumos/illumos-gate/commit/2bad225

Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
2016-01-28 12:45:39 -05:00
Brian Behlendorf
b6fcb792ca Illumos 6414 - vdev_config_sync could be simpler
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6414
  https://github.com/illumos/illumos-gate/commit/eb5bb58

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
2016-01-28 12:44:39 -05:00
Simon Klinkert
1a04bab348 llumos 6334 - Cannot unlink files when over quota
6334 Cannot unlink files when over quota
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/6334
  https://github.com/illumos/illumos-gate/commit/6575bca

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-26 15:27:08 -08:00
kernelOfTruth
a966c5640e Reintroduce zfs_remove() synchronous deletes
Reintroduce a slightly adapted version of the Illumos logic for
synchronous unlinks.  The basic idea here is that only files
smaller than zfs_delete_blocks (20480) blocks should be deleted
synchronously.  Unlinking larger files should be handled
asynchronously to minimize impact to the caller.

To accomplish this iput() which is responsible for calling
zfs_znode_delete() on Linux is only called in the delete_now
path.  Otherwise zfs_async_iput() is used which allows the
last reference to be dropped by a taskq thread effectively
making the removal asynchronous.

Porting notes:
- Add zfs_delete_blocks module option for performance analysis.
  The default value is DMU_MAX_DELETEBLKCNT which is the same
  as upstream.  Reducing this value means that smaller files
  will be unlinked asynchronously like large files.
- All occurrences of zfsvfs changes to zsb.

Ported-by: KernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-26 15:26:02 -08:00
Dan McDonald
460a021391 Log zvol truncate/discard operations
As the comments in zvol_discard() suggested, the discard operation
could be logged to the zil.  This is a port of the relevant code from
Nexenta as it was added in "701 UNMAP support for COMSTAR" and has been
attributed to the author of that commit.

References:
  https://github.com/Nexenta/illumos-nexenta/commit/b77b923
  https://github.com/zfsonlinux/zfs/blob/089fa91b/module/zfs/zvol.c#L637

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-26 14:16:03 -08:00
George Wilson
ba5ad9a48d Illumos 6251 - add tunable to disable free_bpobj processing
6251 - add tunable to disable free_bpobj processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/6251
  https://github.com/illumos/illumos-gate/commit/139510f

Porting notes:
- Added as module option declaration.
- Added to zfs-module-parameters.5 man page.

Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-25 13:15:17 -08:00
Tim Chase
0a1f8cd999 Set arc_c_min properly in userland builds
Since it's set to arc_c_max / 2, it must be set after arc_c_max is set.
Also added protection against it falling below 2 * maxblocksize in
userland builds.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4268
2016-01-25 10:25:10 -08:00
Tim Chase
1b8951b319 Prevent arc_c collapse
Adjusting arc_c directly is racy because it can happen in the context
of multiple threads.  It should always be >= 2 * maxblocksize.  Set it
to a known valid value rather than adjusting it directly.

In addition refactor arc_shrink() to a simpler structure, protect against
underflow in the calculation of the new arc_c value.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reverts: 935434ef
Closes: #3904
Closes: #4161
2016-01-25 10:25:04 -08:00
Brian Behlendorf
6b38e7510f Remove RLIM64_INFINITY assert in vn_rdwr()
Previous commit be29e6a updated kobj_read_file() so it no longer
unconditionally passes RLIM64_INFINITY.  The vn_rdwr() function
needs to be updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #513
2016-01-23 11:16:23 -08:00
Richard Yao
be29e6a6e6 kobj_read_file: Return -1 on vn_rdwr() error
I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.

Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #513
2016-01-23 10:10:44 -08:00
Matthew Ahrens
19d55079ae Illumos 4950 - files sometimes can't be removed from a full filesystem
4950 files sometimes can't be removed from a full filesystem
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4950
  https://github.com/illumos/illumos-gate/commit/4bb7380

Porting notes:
- ZoL currently does not log discards to zvols, so the portion of
  this patch that modifies the discard logging to mark it as
  freeing space has been discarded.

2. may_delete_now had been removed from zfs_remove() in ZoL.
   It has been reintroduced.

3. We do not try to emulate vnodes, so the following lines are
   not valid on Linux:

	mutex_enter(&vp->v_lock);
	may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp);
	mutex_exit(&vp->v_lock);

  This has been replaced with:

	mutex_enter(&zp->z_lock);
	may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped);
	mutex_exit(&zp->z_lock);

Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-21 16:59:30 -08:00
Brian Behlendorf
37c56346cc Close possible zfs_znode_held() race
Check if the lock is held while holding the z_hold_locks() lock.
This prevents a possible use-after-free bug for callers which are
not holding the lock.  There currently are no such callers so this
can't cause a problem today but it has been fixed regardless.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4244
Issue #4124
2016-01-20 13:36:15 -08:00
Chunwei Chen
16522ac290 Use tsd to store tq for taskq_member
To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #500
Closes #504
Closes #505
2016-01-20 13:07:45 -08:00
Brian Behlendorf
4967a3eb9d Linux 4.5 compat: xattr list handler
The registered xattr .list handler was simplified in the 4.5 kernel
to only perform a permission check.  Given a dentry for the file it
must return a boolean indicating if the name is visible.  This
differs slightly from the previous APIs which also required the
function to copy the name in to the provided list and return its
size.  That is now all the responsibility of the caller.

This should be straight forward change to make to ZoL since we've
always required the caller to make the copy.  However, this was
slightly complicated by the need to support 3 older APIs.  Yes,
between 2.6.32 and 4.5 there are 4 versions of this interface!

Therefore, while the functional change in this patch is small it
includes significant cleanup to make the code understandable and
maintainable.  These changes include:

- Improved configure checks for .list, .get, and .set interfaces.
  - Interfaces checked from newest to oldest.
  - Strict checking for each possible known interface.
  - Configure fails when no known interface is available.
  - HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency
    with similar iops and fops configure checks.

- POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to
  move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}.
  Compatibility wrapper were added for old kernels.

- ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing
  ZPL_XATTR_{GET|SET} WRAPPERs.  Only the inode is guaranteed to be
  a valid pointer, passing NULL for the 'list' and 'name' variables
  is allowed and must be checked for.  All .list functions were
  updated to use the wrapper to aid readability.

- zpl_xattr_filldir() updated to use the .list function for its
  permission check which is consistent with the updated Linux 4.5
  interface.  If a .list function is registered it should return 0
  to indicate a name should be skipped, if there is no registered
  function the name will be added.

- Additional documentation from xattr(7) describing the correct
  behavior for each namespace was added before the relevant handlers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
2016-01-20 11:36:56 -08:00
Brian Behlendorf
beeed4596b Linux 4.5 compat: get_link() / put_link()
The follow_link() interface was retired in favor of get_link().
In the process of phasing in get_link() the Linux kernel went
through two different versions.  The first of which depended
on put_link() and the final version on a delayed done function.

- Improved configure checks for .follow_link, .get_link, .put_link.
  - Interfaces checked from newest to oldest.
  - Strict checking for each possible known interface.
  - Configure fails when no known interface is available.

- Both versions .get_link are detected and supported as well
  two previous versions of .follow_link.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
2016-01-20 11:36:00 -08:00
Josef 'Jeff' Sipek
bc89ac8479 Illumos 5045 - use atomic_{inc,dec}_* instead of atomic_add_*
5045 use atomic_{inc,dec}_* instead of atomic_add_*
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5045
  https://github.com/illumos/illumos-gate/commit/1a5e258

Porting notes:
- All changes to non-ZFS files dropped.
- Changes to zfs_vfsops.c dropped because they were Illumos specific.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4220
2016-01-15 15:38:36 -08:00
Marcel Telka
812e91a7e3 Illumos 4039 - zfs_rename()/zfs_link() needs stronger test for XDEV
4039 zfs_rename()/zfs_link() needs stronger test for XDEV
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4039
  https://github.com/illumos/illumos-gate/commit/18e6497

Porting notes:
- This check was updated in Linux in a similar fashion early on in
  the port.  Therefore, this patch just reorders the function and
  updates the comment so it flows the same way as the upstream code.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4218
2016-01-15 15:38:35 -08:00
George Wilson
59d4c71cca Illumos 3557, 3558, 3559, 3560
3557 dumpvp_size is not updated correctly when a dump zvol's size is changed
3558 setting the volsize on a dump device does not return back ENOSPC
3559 setting a volsize larger than the space available sometimes succeeds
3560 dumpadm should be able to remove a dump device
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Albert Lee <trisk@nexenta.com>

References:
  https://www.illumos.org/issues/3559
  https://github.com/illumos/illumos-gate/commit/c61ea56

Porting notes:
- Internal zvol.c changes not applied due to implementation differences.
  The external interface and behavior was already consistent with the
  latest upstream code.
- Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check.  All
  supported kernels (2.6.32 and newer) provide this interface.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4217
2016-01-15 15:38:35 -08:00
Chunwei Chen
21f604d460 Prevent duplicated xattr between SA and dir
When replacing an xattr would cause overflowing in SA, we would fallback
to xattr dir. However, current implementation don't clear the one in SA,
so we would end up with duplicated SA.

For example, running the following script on an xattr=sa filesystem
would cause duplicated "user.1".

-- dup_xattr.sh begin --
randbase64()
{
        dd if=/dev/urandom bs=1 count=$1 2>/dev/null | openssl enc -a -A
}

file=$1
touch $file
setfattr -h -n user.1 -v `randbase64 5000` $file
setfattr -h -n user.2 -v `randbase64 20000` $file
setfattr -h -n user.3 -v `randbase64 20000` $file
setfattr -h -n user.1 -v `randbase64 20000` $file
getfattr -m. -d $file
-- dup_xattr.sh end --

Also, when a filesystem is switch from xattr=sa to xattr=on, it will
never modify those in SA. This would cause strange behavior like, you
cannot delete an xattr, or setxattr would cause duplicate and the result
would not match when you getxattr.

For example, the following shell sequence.

-- shell begin --
$ sudo zfs set xattr=sa pp/fs0
$ touch zzz
$ setfattr -n user.test -v asdf zzz
$ sudo zfs set xattr=on pp/fs0
$ setfattr -x user.test zzz
setfattr: zzz: No such attribute
$ getfattr -d zzz
user.test="asdf"
$ setfattr -n user.test -v zxcv zzz
$ getfattr -d zzz
user.test="asdf"
user.test="asdf"
-- shell end --

We fix this behavior, by first finding where the xattr resides before
setxattr. Then, after we successfully updated the xattr in one location,
we will clear the other location. Note that, because update and clear
are not in single tx, we could still end up with duplicated xattr. But
by doing setxattr again, it can be fixed.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3472
Closes #4153
2016-01-15 15:38:35 -08:00
Richard Yao
b10695c8f1 Remove fastwrite mutex
The fast write mutex is intended to protect accounting, but it is
redundant because all accounting is performed through atomic operations.
It also serializes all metaslab IO behind a mutex, which introduces a
theoretical scaling regression that the Illumos developers did not like
when we showed this to them. Removing it makes the selection of the
metaslab_group lock free as it is on Illumos. The selection is not quite
the same without the lock because the loop races with IO completions,
but any imbalances caused by this are likely to be corrected by
subsequent metaslab group selections.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3643
2016-01-15 15:38:35 -08:00
Brian Behlendorf
c96c36fa22 Fix zsb->z_hold_mtx deadlock
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4106
2016-01-15 15:33:45 -08:00
Brian Behlendorf
0720116d4d Add zfs_object_mutex_size module option
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #4106
2016-01-15 15:33:44 -08:00
Brian Behlendorf
89666a8e1c Increase default user space stack size
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest.  This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler.  Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.

$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...

loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4215
2016-01-13 13:55:12 -08:00
Chunwei Chen
e843553d03 Don't hold mutex until release cv in cv_wait
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106
2016-01-12 15:18:44 -08:00
Will Andrews
e6cfd633be Illumos 3749 - zfs event processing should work on R/O root filesystems
3749 zfs event processing should work on R/O root filesystems
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3749
  https://github.com/illumos/illumos-gate/commit/3cb69f7

Porting notes:
- [include/sys/spa_impl.h]
  - ffe9d38 Add generic errata infrastructure
  - 1421c89 Add visibility in to arc_read
- [include/sys/fm/fs/zfs.h]
  - 2668527 Add linux events
  - 6283f55 Support custom build directories and move includes
- [module/zfs/spa_config.c]
  - Updated spa_config_sync() to match illumos with the exception
    of a Linux specific block.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 14:42:32 -08:00
Justin Gibbs
ee3a23b84e Illumos 5438 - zfs_blkptr_verify should continue after zfs_panic_recover
5438 zfs_blkptr_verify should continue after zfs_panic_recover
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5438
  https://github.com/illumos/illumos-gate/commit/5897eb4

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 13:54:05 -08:00
Josef 'Jeff' Sipek
fc581e0507 Illumos 5515 - dataset user hold doesn't reject empty tags
5515 dataset user hold doesn't reject empty tags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/5515
  https://github.com/illumos/illumos-gate/commit/752fd8d

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 13:52:26 -08:00
George Wilson
a6fb32b85a Illumos 6281 - prefetching should apply to 1MB reads
6281 prefetching should apply to 1MB reads
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Alexander Motin <mav@freebsd.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/6281
  https://github.com/illumos/illumos-gate/commit/6328027

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 13:51:27 -08:00
Saso Kiselkov
adfe9d932b Illumos 6367 - spa_config_tryenter incorrectly handles the multiple-lock case
6367 spa_config_tryenter incorrectly handles the multiple-lock case
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Approved by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/6367
  https://github.com/illumos/illumos-gate/commit/e495b6e

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 11:05:28 -08:00
Joe Stein
5f3d9c69d1 Illumos 6295 - metaslab_condense's dbgmsg should include vdev id
6295 metaslab_condense's dbgmsg should include vdev id
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6295
  https://github.com/illumos/illumos-gate/commit/daec38e

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 11:02:07 -08:00
Justin T. Gibbs
0eb21616fa Illumos 6171 - dsl_prop_unregister() slows down dataset eviction.
6171 dsl_prop_unregister() slows down dataset eviction.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/6171
  https://github.com/illumos/illumos-gate/commit/03bad06

Porting notes:
  - Conflicts
    - 3558fd7 Prototype/structure update for Linux
    - 2cf7f52 Linux compat 2.6.39: mount_nodev()
    - 13fe019 Illumos #3464
    - 241b541 Illumos 5959 - clean up per-dataset feature count code
  - dsl_prop_unregister() preserved until out of tree consumers
    like Lustre can transition to dsl_prop_unregister_all().
  - Fixing 'space or tab at end of line' in include/sys/dsl_dataset.h

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 10:53:12 -08:00
Matthew Ahrens
5a28a9737a Illumos 6288 - dmu_buf_will_dirty could be faster
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6288
  https://github.com/illumos/illumos-gate/commit/0f2e7d0

Porting notes:
- [module/zfs/dbuf.c]
  - Fix 'warning: ISO C90 forbids mixed declarations and code'
    by moving 'dbuf_dirty_record_t *dr' to start of code block.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:13:52 -08:00
George Wilson
2e8efe1bef Illumos 6292 - exporting a pool while an async destroy
6292 exporting a pool while an async destroy is running can leave
entries in the deferred tree
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/6292
  https://github.com/illumos/illumos-gate/commit/a443cc8

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:10:52 -08:00
Matthew Ahrens
5511754b4f Illumos 6319 - assertion failed in zio_ddt_write: bp->blk_birth == txg
6319 assertion failed in zio_ddt_write: bp->blk_birth == txg
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/6319
  https://github.com/illumos/illumos-gate/commit/b39b744

Porting notes:
- Re-enabled ztest for CentOS test slaves.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3449
2016-01-12 09:10:52 -08:00
Matthew Ahrens
7f60329a26 Illumos 5987 - zfs prefetch code needs work
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/5987 zfs prefetch code needs work
  illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work

Porting notes:
- [module/zfs/dbuf.c]
  - 5f6d0b6 Handle block pointers with a corrupt logical size
- [module/zfs/dmu_zfetch.c]
  - c65aa5b Fix gcc missing parenthesis warnings
  - 428870f Update core ZFS code from build 121 to build 141.
  - 79c76d5 Change KM_PUSHPAGE -> KM_SLEEP
  - b8d06fc Switch KM_SLEEP to KM_PUSHPAGE
  - Account for ISO C90 - mixed declarations and code - warnings
  - Module parameters (new/changed):
    - Replaced zfetch_block_cap with zfetch_max_distance
      (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024))
    - Preserved zfs_prefetch_disable as 'int' for consistency with
      existing Linux module options.
- [include/sys/trace_arc.h]
  - Added new tracepoints
    - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async);
    - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch);
- [man/man5/zfs-module-parameters.5]
  - Updated man page

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:02:33 -08:00
Brian Behlendorf
ab5cbbd107 Illumos 6293 - ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6293
  https://github.com/illumos/illumos-gate/commit/8fe00bf

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 14:10:31 -08:00
Brian Behlendorf
b870c7e5f4 Revert "Illumos 3749 - zfs event processing should work on R/O root filesystems"
This reverts commit b47637ecdc which
introduced a regression in ztest.

$ ./cmd/ztest/ztest -V
5 vdevs, 7 datasets, 23 threads, 300 seconds...
*** Error in `/rpool/home/behlendo/src/git/zfs/cmd/ztest/.libs/lt-ztest':
double free or corruption (fasttop): 0x0000000000d339f0 ***
2016-01-11 14:10:30 -08:00
Brian Behlendorf
b858767a31 Fix 'prevsnap property' build failure
Fix build failure accidentally introduced by 1715493.  This only
results in a failure when debugging is disabled.

dsl_dataset.c: In function 'dsl_dataset_stats':
dsl_dataset.c:1698:45: error: 'dp' undeclared (first use in this function)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 13:35:40 -08:00
Matthew Ahrens
1715493f38 Illumos 4929 - want prevsnap property
4929 want prevsnap property
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4929
  https://github.com/illumos/illumos-gate/commit/b461c74

Porting notes:
- [include/sys/fs/zfs.h]
  - f67d70 Create an 'overlay' property
  - 11b9ec Add full SELinux support
- [fs/zfs/dsl_dataset.c]
  - This increases the stack size of dsl_dataset_stats() but
    nothing has been changed until this is shown to be an issue.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 11:58:26 -08:00
Marcel Telka
f3c9dca093 Illumos 4638 - Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4638
  https://github.com/illumos/illumos-gate/commit/2144b12

Porting notes:
- [module/zfs/zfs_vnops.c]
  - 3558fd7 Prototype/structure update for Linux
  - 2cf7f52 Linux compat 2.6.39: mount_nodev()
  - Use zfs_is_readonly() wrapper
  - Remove first line of comment which doesn't apply

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 10:29:48 -08:00
Will Andrews
b47637ecdc Illumos 3749 - zfs event processing should work on R/O root filesystems
3749 zfs event processing should work on R/O root filesystems
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3749
  https://github.com/illumos/illumos-gate/commit/3cb69f7

Porting notes:
- [include/sys/spa_impl.h]
  - ffe9d38 Add generic errata infrastructure
  - 1421c89 Add visibility in to arc_read
- [include/sys/fm/fs/zfs.h]
  - 2668527 Add linux events
  - 6283f55 Support custom build directories and move includes
- [module/zfs/spa_config.c]
  - Updated spa_config_sync() to match illumos with the exception
    of a Linux specific block.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 09:23:37 -08:00
Brian Behlendorf
e9e3d31d2c Allow 16M send/recv blocks
Fix an off by one error introduced by fcff0f3 which triggers an
assertion when 16M blocks are used with send/recv.  This fix was
intentionally not folder in to the Illumos commit so it can be
easily cherry-picked by upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-08 20:23:23 -05:00
Paul Dagnelie
fcff0f35bd Illumos 5960, 5925
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/5960
  https://www.illumos.org/issues/5925
  https://github.com/illumos/illumos-gate/commit/a2cdcdd

Porting notes:
- [lib/libzfs/libzfs_sendrecv.c]
  - b8864a2 Fix gcc cast warnings
  - 325f023 Add linux kernel device support
  - 5c3f61e Increase Linux pipe buffer size on 'zfs receive'
- [module/zfs/zfs_vnops.c]
  - 3558fd7 Prototype/structure update for Linux
  - c12e3a5 Restructure zfs_readdir() to fix regressions
- [module/zfs/zvol.c]
  - Function @zvol_map_block() isn't needed in ZoL
  - 9965059 Prefetch start and end of volumes
- [module/zfs/dmu.c]
  - Fixed ISO C90 - mixed declarations and code
  - Function dmu_prefetch() 'int i' is initialized before
    the following code block (c90 vs. c99)
- [module/zfs/dbuf.c]
  - fc5bb51 Fix stack dbuf_hold_impl()
  - 9b67f60 Illumos 4757, 4913
  - 34229a2 Reduce stack usage for recursive traverse_visitbp()
- [module/zfs/dmu_send.c]
  - Fixed ISO C90 - mixed declarations and code
  - b58986e Use large stacks when available
  - 241b541 Illumos 5959 - clean up per-dataset feature count code
  - 77aef6f Use vmem_alloc() for nvlists
  - 00b4602 Add linux kernel memory support

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-08 15:08:19 -08:00
Richard Sharpe
c5d0287011 Fix casesensitivity=insensitive deadlock
When casesensitivity=insensitive is set for the
file system, we can deadlock in a rename if the user uses different case
for each path. For example rename("A/some-file.txt", "a/some-file.txt").

The simple test for this is:

1. mkdir some-dir in a ZFS file system
2. touch some-dir/some-file.txt
3. mv Some-dir/some-file.txt some-dir/some-other-file.txt

This last request deadlocks trying to relock the i_mutex on the inode for
the parent directory.

The solution is to use d_add_ci in zpl_lookup if we are on a file system
that has the casesensitivity=insensitive attribute set.

This patch checks if we are working on a case insensitive file system and if
so, allocates storage for the case insensitive name and passes it to
zfs_lookup and then calls d_add_ci instead of d_splice_alias.

The performance impact seems to be minimal even though we have introduced a
kmalloc and kfree in the lookup path.

The problem was found when running Microsoft's FSCT against Samba on top of
ZFS On Linux.

Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4136
2016-01-08 11:05:07 -08:00
Jeremy Jones
b23ad7f350 Illumos 3139 - zdb dies when it tries to determine path of unlinked file
3139 zdb dies when it tries to determine path of unlinked file
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://github.com/illumos/illumos-gate/commit/1ce39b5
  https://www.illumos.org/issues/3139

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-05 11:25:41 -08:00
Matthew Ahrens
37f8a8835a Illumos 5746 - more checksumming in zfs send
5746 more checksumming in zfs send
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>

References:
  https://www.illumos.org/issues/5746
  https://github.com/illumos/illumos-gate/commit/98110f0
  https://github.com/zfsonlinux/zfs/issues/905

Porting notes:
- Minor conflicts due to:
  - https://github.com/zfsonlinux/zfs/commit/2024041
  - https://github.com/zfsonlinux/zfs/commit/044baf0
  - https://github.com/zfsonlinux/zfs/commit/88904bb
- Fix ISO C90 warnings (-Werror=declaration-after-statement)
  - arc_buf_t *abuf;
  - dmu_buf_t *bonus;
  - zio_cksum_t cksum_orig;
  - zio_cksum_t *cksump;
- Fix format '%llx' format specifier warning
- Align message in zstreamdump safe_malloc() with upstream

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3611
2015-12-30 14:24:14 -08:00
Ned Bass
43b4935e53 Prevent SA length overflow
The function sa_update() accepts a 32-bit length parameter and
assigns it to a 16-bit field in sa_bulk_attr_t, potentially
truncating the passed-in value. This could lead to corrupt system
attribute (SA) records getting written to the pool. Add a VERIFY to
sa_update() to detect cases where overflow would occur. The SA length
is limited to 16-bit values by the on-disk format defined by
sa_hdr_phys_t.

The function zfs_sa_set_xattr() is vulnerable to this bug if the
unpacked nvlist of xattrs is less than 64k in size but the packed
size is greater than 64k. Fix this by appropriately checking the
size of the packed nvlist before calling sa_update(). Add error
handling to zpl_xattr_set_sa() to keep the cached list of SA-based
xattrs consistent with the data on disk.

Lastly, zfs_sa_set_xattr() calls dmu_tx_abort() on an assigned
transaction if sa_update() returns an error, but the DMU only allows
unassigned transactions to be aborted. Wrap the sa_update() call in a
VERIFY0, remove the transaction abort, and call dmu_tx_commit()
unconditionally. This is consistent practice with other callers
of sa_update().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4150
2015-12-30 13:20:12 -08:00
Chunwei Chen
f5f087eb88 Make xattr dir truncate and remove in one tx
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4114
Issue #4052
Issue #4006
Issue #3018
Issue #2861
2015-12-28 09:48:26 -08:00
Chunwei Chen
29572ccdef Fix empty xattr dir causing lockup
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4114
Closes #4052
Closes #4006
Closes #3018
Closes #2861
2015-12-28 09:41:30 -08:00
Brian Behlendorf
2ebc7b72b3 Fix z_xattr_lock/z_teardown_lock inversion
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock.  Resolve this by taking the z_teardown_lock in
all registered xattr callbacks prior to taking the z_xattr_lock.
This ensures the locks are always taken is the same order thus
preventing a deadlock.  Note the z_teardown_lock is taken again
in zfs_lookup() and this is safe because the z_teardown lock is
a re-entrant read reader/writer lock.

* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
  __zpl_xattr_get
    zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro

* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
  zfs_resume_fs
    zfs_rezget -> Takes zp->z_xattr_lock

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #3943
Closes #3969
Closes #4121
2015-12-22 16:59:20 -08:00
Brian Behlendorf
228b461b56 Revert "Fix z_xattr_lock/z_teardown_lock lock inversion"
This reverts commit 6b32ef572f754efc3f9edb20d022450f8e6b02d9.
2015-12-22 16:58:43 -08:00
Brian Behlendorf
151f84e2c3 Fix ztest truncated cache file
Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
truncate and overwrite rather than rename the cache file.  This is
the correct fix but it should have only been applied for the kernel
build.  In user space rename(2) is needed because ztest depends on
the cache file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4129
2015-12-22 10:40:40 -08:00
Brian Behlendorf
2a552736b7 Fix do_div() types in condvar:timeout
The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516
2015-12-22 10:32:35 -08:00
Olaf Faaland
448d7aaabc Identify locks flagged by lockdep
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.

This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark
such mutexes when they are initialized.  This mutex type causes
attempts to take or release those locks to be wrapped in lockdep_off()
and lockdep_on() calls to silence the dependency checker and allow the
use of lock_stats to examine contention.

For RW locks, it uses an analogous lock type, RW_NOLOCKDEP.

The goal is that these locks are ultimately changed back to type
MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect
their relationship (e.g. z_name_lock below) or any real problem with the
lock dependencies are fixed.

Some of the affected locks are:

tc_open_lock:
=============
This is an array of locks, all with same name, which txg_quiesce must
take all of in order to move txg to next state.  All default to the same
lockdep class, and so to lockdep appears recursive.

zp->z_name_lock:
================
In zfs_rmdir,
        dzp = znode for the directory (input to zfs_dirent_lock)
        zp  = znode for the entry being removed (output of zfs_dirent_lock)

zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp
zfs_rmdir() takes z_name_lock in zp

Since both dzp and zp are type znode_t, the locks have the same default
class, and lockdep considers it a possible recursive lock attempt.

l->l_rwlock:
============
zap_expand_leaf() sometimes creates two new zap leaf structures, via
these call paths:

zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open()
zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf()

Because both zap_leaf_open() and zap_create_leaf() initialize
l->l_rwlock in their (separate) leaf structures, the lockdep class is
the same, and the linux kernel believes these might both be the same
lock, and emits a possible recursive lock warning.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3895
2015-12-22 10:21:33 -08:00
DHE
dcb6bed1df Make zio_taskq_batch_pct user configurable
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4110
2015-12-18 13:46:23 -08:00
Brian Behlendorf
a58df6f536 Fix zfs_vdev_aggregation_limit bounds checking
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation.  For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-18 13:32:06 -08:00
Brian Behlendorf
6fe53787f3 Fix vdev_queue_aggregate() deadlock
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3808
Closes #3867
2015-12-18 13:27:12 -08:00
Chunwei Chen
b4ad50ac5f Use spl_fstrans_mark instead of memalloc_noio_save
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.

Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.

This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #515
Issue zfsonlinux/zfs#4111
2015-12-18 13:24:52 -08:00
Brian Behlendorf
a8ad3bf02c Fix z_xattr_lock/z_teardown_lock lock inversion
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock.  Detect this case and return EBUSY so zfs_resume_fs()
will mark the inode stale and it can be safely revalidated on next
access.

* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
  __zpl_xattr_get
    zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro

* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
  zfs_resume_fs
    zfs_rezget -> Takes zp->z_xattr_lock

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3969
2015-12-18 13:17:44 -08:00
Tim Chase
200366f23f Provide kstat for taskqs
This patch provides 2 new kstats to display task queues:

  /proc/spl/taskqs-all - Display all task queues
  /proc/spl/taskqs - Display only "active" task queues

A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.

If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.

If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.

If the task queue has any waiters, displays each waiting task's pid.

Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491
2015-12-16 09:35:22 -08:00
Chunwei Chen
2727b9d3b6 Use uio for zvol_{read,write}
Since uio now supports bvec, we can convert bio into uio and reuse
dmu_{read,write}_uio. This way, we can remove some duplicate code.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4078
2015-12-15 16:21:43 -08:00
Chunwei Chen
502923bb44 Fix uio_prefaultpages for 0 length iovec
Userspace can freely pass in whatever iovec it feels like, and it's perfectly
legal to pass an iovec which contains a zero length segment. In the current
implementation, uio_prefaultpages would touch an out of bound byte in the
"last byte" logic. While this probably wouldn't cause any critical error, we
would like uio_prefaultpages to be able to continue gracefully.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4078
2015-12-15 16:19:55 -08:00
Brian Behlendorf
eba9e745dc Handle damaged blk_birth in dsl_deadlist_insert()
If a bit were cleared in `bp->blk_birth` such that the txg birth
was now lower than any other txg_birth in the deadlist, then there
will be no entry before this in the tree.

This should be impossible but regardless error handling code has
been added for this case.  By default this is left as a fatal case
and the blk_birth is logged.  However, setting `zfs_recover=1` will
cause the bp to be placed at the start of the deadlist even though
it contains an invalid blk_birth.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4086
Closes #4089
2015-12-15 16:12:31 -08:00
Brian Behlendorf
1cdb86cba2 Handle block pointers with a corrupt logical size
Commit 5f6d0b6 was originally added to gracefully handle block
pointers with a damaged logical size.  However, it incorrectly
assumed that all passed arc_done_func_t could handle a NULL
arc_buf_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4069
Closes #4080
2015-12-15 16:11:44 -08:00
Brian Behlendorf
245b7ab3d1 Hold the zfs_snapentry_t before dispatch
While exceptionally unlikely to cause a problem the zfs_snapentry_t
hold should be taken before the dispatch to prevent any possibility
of the task being processed before the hold.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-14 12:06:31 -08:00
Chunwei Chen
1997660170 Fix snapshot automount race cause EREMOTE
When a concorrent mount finishes just before calling to
zfsctl_snapshot_ismounted, if we return EISDIR, the VFS will return
with EREMOTE. We should instead just return 0, so VFS may retry and
would likely notice the dentry is alreadly mounted. This will be
inline with when usermode helper return EBUSY.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-14 12:06:31 -08:00
Brian Behlendorf
5ed27c572c Change zfs_snapshot_lock from mutex to rw lock
By changing the zfs_snapshot_lock from a mutex to a rw lock the
zfsctl_lookup_objset() function can be allowed to run concurrently.
This should reduce the latency of fh_to_dentry lookups in ZFS
snapshots which are being accessed over NFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-14 12:06:31 -08:00
Brian Behlendorf
f22f900f15 Fix zfsctl_lookup_objset() deadlock
The zfsctl_snapshot_unmount_delay() function must not be called
from zfsctl_lookup_objset() while it is currently holding the
zfs_snapshot_lock.  This will result in a deadlock.  It is safe
to call zfsctl_snapshot_unmount_delay_impl() directly because the
function already has a reference on the zfs_snapentry_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3997
2015-12-14 12:05:52 -08:00
Brian Behlendorf
5e94284fe5 Set 'zfs_expire_snapshot=0' to disable auto-unmount
There are cases where it's desirable that auto-mounted snapshots
not expire after a fixed duration.  They should be unmounted only
when the filesystem they are a snapshot of is unmounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-14 11:02:32 -08:00
Brian Behlendorf
2c4332cf79 Fix cstyle issues in spl-taskq.c and taskq.h
This patch only addresses the issues identified by the style checker.
It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-11 16:20:22 -08:00
Chunwei Chen
066b89e685 Don't use tq->tq_lock_flags
The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #506
2015-12-11 16:20:03 -08:00
Olaf Faaland
326172d854 Subclass tq_lock to eliminate a lockdep warning
When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking.  This is
a false positive.

One such call chain is as follows, when a taskq needs more threads:
	taskq_dispatch->taskq_thread_spawn->taskq_dispatch

The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads.  The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock.  Without subclassing, lockdep believes these
could potentially be the same lock and complains.  A similar case occurs
when taskq_dispatch() then calls task_alloc().

This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:

subclass              taskq
TQ_LOCK_DYNAMIC       dynamic_taskq
TQ_LOCK_GENERAL       any other

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
2015-12-11 16:19:56 -08:00
Brian Behlendorf
c5a8b1e163 Revert "Make taskq_member() use ->journal_info"
This reverts commit a430c11f0b.  Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500
2015-12-08 17:12:36 -08:00
Chunwei Chen
24ef51f660 Use spa as key besides objsetid for snapentry
objsetid is not unique across pool, so using it solely as key would cause
panic when automounting two snapshot on different pools with the same
objsetid. We fix this by adding spa pointer as additional key.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3948
Issue #3786
Issue #3887
2015-12-08 16:38:56 -08:00
Richard Yao
a430c11f0b Make taskq_member() use ->journal_info
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame.  This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.

This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #500
2015-12-08 13:24:47 -08:00
Brian Behlendorf
b58986eebf Use large stacks when available
While stack size will vary by architecture it has historically defaulted to
8K on x86_64 systems.  However, as of Linux 3.15 the default thread stack
size was increased to 16K.  These kernels are now the default in most non-
enterprise distributions which means we no longer need to assume 8K stacks.

This patch takes advantage of that fact by appropriately reverting stack
conservation changes which were made to ensure stability.  Changes which
may have had a negative impact on performance for certain workloads.  This
also has the side effect of bringing the code slightly more in line with
upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4059
2015-12-07 12:20:43 -08:00
Matthew Ahrens
241b541574 Illumos 5959 - clean up per-dataset feature count code
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5959
  https://github.com/illumos/illumos-gate/commit/ca0cc39

Porting notes:

illumos code doesn't check for feature_get_refcount() returning
ENOTSUP (which means feature is disabled) in zdb. zfsonlinux added
a check in https://github.com/zfsonlinux/zfs/commit/784652c
due to #3468. The check was reintroduced here.

Ported-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3965
2015-12-04 14:20:20 -08:00
Brian Behlendorf
072484504f Add zap_prefetch() interface
Provide a generic interface to prefetch ZAP entries by name.  This
functionality is being added for external consumers such as Lustre.
It is based of the existing zap_prefetch_uint64() version which is
used by the deduplication code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4061
2015-12-04 09:39:20 -08:00
Richard Yao
1683e75edc Fix race between getf() and areleasef()
If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.

We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #492
2015-12-03 15:44:47 -08:00
tuxoko
b0fe1adeb1 Prevent rm modules.* when make install
This was originally in fe0ed8f910, but somehow
was changed and not working anymore. And it will cause the following error:

modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4027
2015-12-02 14:39:12 -08:00
tuxoko
d28c5c4f04 Prevent rm modules.* when make install
This was originally in e80cd06b8e, but somehow
was changed and not working anymore. And it will cause the following error:

modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #501
2015-12-02 14:38:20 -08:00
Dimitri John Ledkov
9f456111ab spl-kmem-cache: include linux/prefetch.h for prefetchw()
This is needed for architectures that do not have a builtin prefetchw()

Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #502
2015-12-02 12:45:06 -08:00
Chunwei Chen
61d482f7cd Linux 4.4 compat: xattr operations takes xattr_handler
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-01 16:48:25 -08:00
Chunwei Chen
1a09371678 Linux 4.4 compat: make_request_fn returns blk_qc_t
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-01 16:48:08 -08:00
tuxoko
43518d92fd Fix zfs_dirty_data_max overflow on 32-bit
On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow,
causing it to be smaller than zfs_dirty_data_sync, and will cause txg being
delayed while no one write to disk. The end result is horrendous write speed.

On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it
can reach speed on par with 64-bit VM.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3973
2015-11-19 16:02:47 -08:00
tuxoko
d0c614ecf9 Fix null pointer in arc_kmem_reap_now on 32-bit
On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL.
So we need to avoid reaping them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3973
2015-11-19 16:01:47 -08:00
Chunwei Chen
d287880afd Fix snapshot automount behavior when concurrent or fail
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.

Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.

The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4018
2015-11-19 15:36:59 -08:00
Brian Behlendorf
3d8d245fb3 Follow 0/-E convention for module load errors
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned.  This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter.  This is what was causing the hard failures
in the automated testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-11-16 16:10:06 -08:00
Brian Behlendorf
e7b75d9b46 Limit maximum object size in kmem tests
Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests.  Large values can result in out of memory errors
for systems with less the 512M of memory.  Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-11-16 15:02:24 -08:00
AndCycle
256fa983f4 Obey arc_meta_limit default size when changing arc_max
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit.  Otherwise, the arc_meta_limit
may be set the same as arc_max.

Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4001
2015-11-13 15:45:22 -08:00
loli10K
31f24932a4 Remove superfluous newline character
Remove superfluous `newline` character from spl_kmem_cache_magazine_size
module parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #499
2015-11-13 15:27:45 -08:00
tuxoko
f5f2b87df0 Fix taskq dynamic spawning
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.

For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy

This will easily deadlock before this patch.

Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #496
2015-11-13 15:02:55 -08:00
Chunwei Chen
3e7e6f34d0 Don't call kmem_cache_shrink from shrinker
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes #487
2015-11-11 13:48:31 -08:00
Chunwei Chen
07d63f0cb9 Fix fail path in zfs_znode_alloc
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().

Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.

We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
2015-10-13 15:57:17 -07:00
Chunwei Chen
aa159afb56 Fix use-after-free in vdev_disk_physio_completion
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.

We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3917
Issue #3880
2015-10-13 15:25:33 -07:00
Justin T. Gibbs
bc4501f75a Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6267
  https://github.com/illumos/illumos-gate/commit/d205810

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ned Bass bass6@llnl.gov
Issue #3865
Issue #3443
2015-10-13 14:12:02 -07:00
Brian Behlendorf
9b13f65d28 Fix CPU hotplug
Allocate a kmem cache magazine for every possible CPU which might
be added to the system.  This ensures that when one of these CPUs
is enabled it can be safely used immediately.

For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint.  In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #482
2015-10-13 09:50:40 -07:00
Brian Behlendorf
935434ef01 Fix 'arc_c < arc_c_min' panic
Strictly enforce keeping 'arc_c >= arc_c_min'.  The ASSERTs are
left in place to catch this in a debug build but logic has been
added to gracefully handle in a production build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3904
2015-10-13 09:23:35 -07:00
Richard Yao
919efe93cb zfs_inode_update should not call dmu_object_size_from_db under spinlock
We should never block when holding a spin lock, but zfs_inode_update can
block in the critical section of a spin lock in zfs_inode_update:

zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
2015-09-30 10:47:40 -07:00
Richard Yao
bc8ffb2d08 Remove obsolete zv_lock
All users of zv_lock were removed by 37f9dac, but we forgot to remove
it.  Lets remove it as clean up.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
2015-09-30 10:43:19 -07:00
Chunwei Chen
45838e3a41 Fix uioskip crash when skip to end
When doing uioskip to skip an iovec to the very end, the current loop
condition will falsely check pass the end of iovec. We fix this checking
uio_iovcnt first.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3806
Closes #3850
2015-09-29 10:06:58 -07:00
Brian Behlendorf
2ebe396046 Fix PAX Patch/Grsec SLAB_USERCOPY panic
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled.  When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.

Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.

Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796
2015-09-28 09:18:29 -07:00
Richard Yao
b815ec32b3 Userspace can pass zero length segments via writev/readv
Userspace can trigger an assertion by passing a zero-length segment
when assertions are enabled:

[27961.614792] VERIFY3(skip < iov->iov_len) failed (0 < 0)
[27961.614795] PANIC at zfs_uio.c:187:uio_prefaultpages()
[27961.614805] Call Trace:
[27961.614811]   dump_stack+0x45/0x57
[27961.614830]   spl_dumpstack+0x44/0x50 [spl]
[27961.614834]   spl_panic+0xbb/0x100 [spl]
[27961.614908]   uio_prefaultpages+0x134/0x140 [zcommon]
[27961.614930]   zfs_write+0x1fd/0xe80 [zfs]
[27961.615014]   zpl_write_common_iovec+0x7f/0x110 [zfs]
[27961.615035]   zpl_iter_write+0xa0/0xd0 [zfs]
[27961.615037]   do_iter_readv_writev+0x59/0x80
[27961.615063]   do_readv_writev+0x11b/0x260
[27961.615098]   vfs_writev+0x39/0x50
[27961.615100]   SyS_writev+0x4a/0xe0
[27961.615103]   system_call_fastpath+0x16/0x6e

The solution is to delete the assertion. This could potentially
occur in uiomove as well, which contains analogous assertions
that appear similarly unnecessary, so we remove those as well.

Reported-by: Jonathan Vasquez <jvasquez1011@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3792
2015-09-25 12:51:16 -07:00
Brian Behlendorf
a3000f9358 Revert "dmu_objset_userquota_get_ids uses dn_bonus unsafely"
This reverts commit 5f8e1e8505.  It
was determined that this patch introduced the quota regression
described in #3789.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
Issue #3789
2015-09-25 12:50:24 -07:00
Brian Behlendorf
5592404784 Fix synchronous behavior in __vdev_disk_physio()
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
2015-09-25 12:47:31 -07:00
Brian Behlendorf
ef5b2e1048 Avoid blocking in arc_reclaim_thread()
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking.  Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf().  This will deadlock, see
issue #3822 for full backtraces showing the problem.

To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function.  This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer.  Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
2015-09-25 12:45:47 -07:00
Brian Behlendorf
04870568e6 Disable zpl_nr_cached_objects() callback
The zpl_nr_cached_objects() function has been disabled because in the
current code it doesn't provide any critical functionality and it may
result in a deadlock under certain circumstances.  However, because
we expect to need these hooks in the future this code has not been
entirely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3719
2015-09-25 12:45:42 -07:00
Brian Behlendorf
d4787d55ad Allow NFS activity to defer snapshot unmounts
Accessing a snapshot via NFS should cause an auto-unmount of that
snapshot to be deferred until such as time as the snapshot is idle.
This is analogous to the zpl_revalidate logic employed by locally
mounted snapshots.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3794
2015-09-25 12:45:38 -07:00
Lukas Wunner
784a7fe5d9 Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE
Commit torvalds/linux@4246a0b63b
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.

bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592 ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
2015-09-25 12:44:54 -07:00
Don Brady
56b3986316 Add large block support to zpios(1) benchmark
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.

  `zpios ... -S size | --blocksize size ...`

This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`

Example run below
(non realistic results from a VM and output abbreviated for space)

```
 --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
 --threaddelay=0 --cleanup --human-readable --verbose --cleanup
 --blocksize=128K,256K,512K,1M

 th-cnt  rg-cnt  rg-sz  ch-sz  blksz  wr-data wr-bw   rd-data rd-bw
---------------------------------------------------------------------
 4       750     8m     1m     128k   5g      90.06m  5g      93.37m
 4       750     8m     1m     256k   5g      79.71m  5g      99.81m
 4       750     8m     1m     512k   5g      42.20m  5g      93.14m
 4       750     8m     1m     1m     5g      35.51m  5g      89.36m
 8       750     8m     1m     128k   5g      85.49m  5g      90.81m
 8       750     8m     1m     256k   5g      61.42m  5g      99.24m
 8       750     8m     1m     512k   5g      49.09m  5g     108.78m
 16      750     8m     1m     128k   5g      86.28m  5g      88.73m
 16      750     8m     1m     256k   5g      64.34m  5g      93.47m
 16      750     8m     1m     512k   5g      68.84m  5g     124.47m
 16      750     8m     1m     1m     5g      53.97m  5g      97.20m
---------------------------------------------------------------------
```

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3795
Closes #2071
2015-09-22 09:13:20 -07:00
Ned Bass
3af56fd95f Honor xattr=sa dataset property
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:

* Libzfs has historically included "xattr" in its list of default mount
  options. This overrides the dataset property, so the dataset is always
  configured to use directory-based xattrs even when the xattr dataset
  property is set to off or sa. Address this by removing "xattr" from
  the set of default mount options in libzfs.

* There was no way to enable system attribute-based extended attributes
  using temporary mount options. Add the mount options "saxattr" and
  "dirxattr" which enable the xattr behavior their names suggest.  This
  approach has the advantages of mirroring the valid xattr dataset
  property values and following existing conventions for mount option
  names.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3787
2015-09-19 14:04:14 -07:00
Brian Behlendorf
66aad10ce8 Fix NULL as mount(2) syscall data parameter
Passing NULL for the mount data should not result in EINVAL.  It
should be treated as if an empty string were passed and succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3771
2015-09-19 14:03:01 -07:00
Richard Yao
f52ebcb3eb Discard on zvols should not exceed the length of a block
37f9dac592 replaced the end-start
calculation with a cached value, but neglected to update it on discard
operations. This can cause us to discard data not requested, causing
data loss on zvols.

Reported-by: Richard Connon <richard.connon@zynstra.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3798
2015-09-19 14:00:14 -07:00
Arne Jansen
4e0f33ffe0 Illumos 6214 - zpools going south
6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>

References:
  https://www.illumos.org/issues/6214
  http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/

Porting Notes:

Reintroduce b_compress to the l2arc_buf_hdr_t.  In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t.  This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held.  See Illumos 6214
for a detailed analysis of the race.

HDR_GET_COMPRESS() macro was removed from arc_buf_info().

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3757
2015-09-11 11:14:38 -07:00
Brian Behlendorf
9965059ab9 Prefetch start and end of volumes
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume.  Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
2015-09-09 14:38:29 -07:00
Richard Yao
d4bf6d8429 Disable direct reclaim in taskq worker threads on Linux 3.9+
Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes #474
2015-09-09 14:30:47 -07:00
Richard Yao
8198d18ca7 Reintroduce IO accounting on zvols on Linux 3.19+
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later.  We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3741
2015-09-09 09:29:24 -07:00
Brian Behlendorf
3b36f8319d Add dbgmsg kstat
Internally ZFS keeps a small log to facilitate debugging.  By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1.  The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.

$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp    message
1441141408   spa=tank async request task=1
1441141408   txg 70 open pool version 5000; software version 5000/5; ...
1441141409   spa=tank async request task=32
1441141409   txg 72 import pool version 5000; software version 5000/5; ...
1441141414   command: lt-zpool import tank

Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log.  As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat.  For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.

$ ZFS_DEBUG=on ./cmd/ztest/ztest -V

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3728
2015-09-04 16:08:14 -07:00
Brian Behlendorf
0500e835af Support accessing .zfs/snapshot via NFS
This patch is based on the previous work done by @andrey-ve and
@yshui.  It triggers the automount by using kern_path() to traverse
to the known snapshout mount point.  Once the snapshot is mounted
NFS can access the contents of the snapshot.

Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server.  To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally.  System administators desiring this
functionally must explicitly enable it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2797
Closes #1655
Closes #616
2015-09-04 13:23:53 -07:00