Commit Graph

1308 Commits

Author SHA1 Message Date
Ryan Moeller
a631283b74 Move zfsdev_state_{init,destroy} to common code
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11833
2021-04-08 21:17:43 -07:00
Brian Behlendorf
600a1dc54c
Use dsl_scan_setup_check() to setup a scrub
When a rebuild completes it will automatically schedule a follow up
scrub to verify all of the block checksums.  Before setting up the
scrub execute the counterpart dsl_scan_setup_check() function to
confirm the scrub can be started.  Prior to this change we'd only
check vdev_rebuild_active() which isn't as comprehensive, and using
the check function keeps all of this logic in one place.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11849
2021-04-08 14:33:15 -07:00
Ryan Moeller
e778b0485b
Ratelimit deadman zevents as with delay zevents
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11786
2021-04-07 16:23:57 -07:00
Andrea Gelmini
bf169e9f15 Fix various typos
Correct an assortment of typos throughout the code base.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11774
2021-04-02 18:52:15 -07:00
Matthew Ahrens
2b56a63457
Use a helper function to clarify gang block size
For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the
gang DVA including its children BP's.  The space allocated at each DVA's
vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`.

This commit makes this relationship more clear by using a helper
function, `vdev_gang_header_asize()`, for the space allocated at the
gang block's vdev/offset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11744
2021-03-26 11:19:35 -07:00
Andrea Gelmini
8a915ba1f6
Removed duplicated includes
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11775
2021-03-22 12:34:58 -07:00
Alexander Motin
891568c990
Split dmu_zfetch() speculation and execution parts
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11652
2021-03-19 22:56:11 -07:00
Chunwei Chen
296a4a369b
Fix zfs_get_data access to files with wrong generation
If TX_WRITE is create on a file, and the file is later deleted and a new
directory is created on the same object id, it is possible that when
zil_commit happens, zfs_get_data will be called on the new directory.
This may result in panic as it tries to do range lock.

This patch fixes this issue by record the generation number during
zfs_log_write, so zfs_get_data can check if the object is valid.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #10593
Closes #11682
2021-03-19 22:53:31 -07:00
Matthew Ahrens
330c6c0523
Clean up RAIDZ/DRAID ereport code
The RAIDZ and DRAID code is responsible for reporting checksum errors on
their child vdevs.  Checksum errors represent events where a disk
returned data or parity that should have been correct, but was not.  In
other words, these are instances of silent data corruption.  The
checksum errors show up in the vdev stats (and thus `zpool status`'s
CKSUM column), and in the event log (`zpool events`).

Note, this is in contrast with the more common "noisy" errors where a
disk goes offline, in which case ZFS knows that the disk is bad and
doesn't try to read it, or the device returns an error on the requested
read or write operation.

RAIDZ/DRAID generate checksum errors via three code paths:

1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
reported on any children whose data was not used during the
reconstruction.  This is handled in `raidz_reconstruct()`.  This is the
most common type of RAIDZ/DRAID checksum error.

2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
means that the data has been lost.  The zio fails and an error is
returned to the consumer (e.g. the read(2) system call).  This would
happen if, for example, three different disks in a RAIDZ2 group are
silently damaged.  Since the damage is silent, it isn't possible to know
which three disks are damaged, so a checksum error is reported against
every child that returned data or parity for this read.  (For DRAID,
typically only one "group" of children is involved in each io.)  This
case is handled in `vdev_raidz_cksum_finish()`. This is the next most
common type of RAIDZ/DRAID checksum error.

3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
case 2), but there happens to be additional copies of this block due to
"ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
copies is good, then RAIDZ/DRAID compares each sector of the data or
parity that it retrieved with the good data from the other DVA, and if
they differ then it reports a checksum error on this child.  This
differs from case 2 in that the checksum error is reported on only the
subset of children that actually have bad data or parity.  This case
happens very rarely, since normally only metadata has ditto blocks.  If
the silent damage is extensive, there will be many instances of case 2,
and the pool will likely be unrecoverable.

The code for handling case 3 is considerably more complicated than the
other cases, for two reasons:

1. It needs to run after the main raidz read logic has completed.  The
data RAIDZ read needs to be preserved until after the alternate DVA has
been read, which necessitates refcounts and callbacks managed by the
non-raidz-specific zio layer.

2. It's nontrivial to map the sections of data read by RAIDZ to the
correct data.  For example, the correct data does not include the parity
information, so the parity must be recalculated based on the correct
data, and then compared to the parity that was read from the RAIDZ
children.

Due to the complexity of case 3, the rareness of hitting it, and the
minimal benefit it provides above case 2, this commit removes the code
for case 3.  These types of errors will now be handled the same as case
2, i.e. the checksum error will be reported against all children that
returned data or parity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11735
2021-03-19 16:22:10 -07:00
Matthew Ahrens
46df6e98aa
Remove unused rr_code
The `rr_code` field in `raidz_row_t` is unused.

This commit removes the field, as well as the code that's used to set
it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11736
2021-03-17 21:57:09 -07:00
Ryan Moeller
f845b2dd1c
FreeBSD: Clean up zfsdev_close to match Linux
Resolve some oddities in zfsdev_close() which could result in a
panic and were not present in the equivalent function for Linux.

- Remove unused definition ZFS_MIN_MINOR
- FreeBSD: Simplify zfsdev state destruction
- Assert zs_minor is valid in zfsdev_close
- Make locking around zfsdev state match Linux

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11720
2021-03-12 16:09:15 -08:00
Brian Behlendorf
e7a06356c1
Suppress cppcheck invalidSyntax warninigs
For some reason cppcheck 1.90 is generating an invalidSyntax warning
when the BF64_SET macro is used in the zstream source.  The same
warning is not reported by cppcheck 2.3, nor is their any evident
problem with the expanded macro.  This appears to be an issue with
this version of cppcheck.  This commit annotates the source to suppress
the warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11700
2021-03-05 17:56:35 -08:00
Prakash Surya
f01eaed455
Add upper bound for slop space calculation
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #11023
2021-02-24 09:52:43 -08:00
Brian Atkinson
c0801bf35a
Cleaning up uio headers
Making uio_impl.h the common header interface between Linux and FreeBSD
so both OS's can share a common header file. This also helps reduce code
duplication for zfs_uio_t for each OS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11622
2021-02-20 20:16:50 -08:00
Christian Schwarz
edc508ac0b libzpool: set_global_var: refactor to not modify 'arg'
Also fixes leak of the dlopen handle in the error case.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11602
2021-02-19 22:45:04 -08:00
Ryan Moeller
64e0fe14ff
Restore FreeBSD resource usage accounting
Add zfs_racct_* interfaces for platform-dependent read/write accounting.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11613
2021-02-19 22:34:33 -08:00
Don Brady
03e02e5b56
Checksum errors may not be counted
Fix regression seen in issue #11545 where checksum errors 
where not being counted or showing up in a zpool event.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #11609
2021-02-19 22:33:15 -08:00
Colm
658fb8020f
Add "compatibility" property for zpool feature sets
Property to allow sets of features to be specified; for compatibility
with specific versions / releases / external systems. Influences
the behavior of 'zpool upgrade' and 'zpool create'. Initial man
page changes and test cases included.

Brief synopsis:

zpool create -o compatibility=off|legacy|file[,file...] pool vdev...

compatibility = off : disable compatibility mode (enable all features)
compatibility = legacy : request that no features be enabled
compatibility = file[,file...] : read features from specified files.
Only features present in *all* files will be enabled on the
resulting pool. Filenames may be absolute, or relative to
/etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc
checked first).

Only affects zpool create, zpool upgrade and zpool status.

ABI changes in libzfs:

* New function "zpool_load_compat" to load and parse compat sets.
* Add "zpool_compat_status_t" typedef for compatibility parse status.
* Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum
* Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum

An initial set of base compatibility sets are included in
cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is
modified to install these in $pkgdatadir/compatibility.d and to
create symbolic links to a reasonable set of aliases.

Reviewed-by: ericloewe
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #11468
2021-02-17 21:30:45 -08:00
Ryan Moeller
436ab35a53
Make inline ABD predicates compatible with C++
FreeBSD's zfsd fails to build after e2af2acce3 due to strict type
checking errors from the implicit conversion between bool and boolean_t
in the inline predicate definitions in abd.h.

Use conditionals to return the correct value type from these functions.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11592
2021-02-15 10:15:50 -08:00
khng300
fc273894d2
Rename zfs_inode_update to zfs_znode_update_vfs
zfs_znode_update_vfs is a more platform-agnostic name than
zfs_inode_update. Besides that, the function's prototype is moved to
include/sys/zfs_znode.h as the function is also used in common code.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ka Ho Ng <khng300@gmail.com>
Sponsored by: The FreeBSD Foundation
Closes #11580
2021-02-09 11:17:29 -08:00
Matthew Ahrens
2d4bbd14fc
The abd child/parent relationship does not need to be tracked
ABD's currently track their parent/child relationship.  This applies to
`abd_get_offset()` and `abd_borrow_buf()`.  However, nothing depends on
knowing this relationship, it's only used for consistency checks to
verify that we are not destroying an ABD that's still in use.  When we
are creating/destroying ABD's frequently, the performance impact of
maintaining these data structures (in particular the atomic
increment/decrement operations) can be measurable.

This commit removes this verification code on production builds, but
keeps it when ZFS_DEBUG is set.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11535
2021-01-30 10:04:42 -08:00
Alan Somers
cf0977ad72 Parallelize vdev_validate
The runtime of vdev_validate is dominated by the disk accesses in
vdev_label_read_config.  Speed it up by validating all vdevs in
parallel using a taskq.

Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #11470
2021-01-26 19:36:51 -08:00
Alan Somers
a0e01997ec Parallelize vdev_load
metaslab_init is the slowest part of importing a mature pool, and it
must be repeated hundreds of times for each top-level vdev.  But its
speed is dominated by a few serialized disk accesses.  That can lead to
import times of > 1 hour for pools with many top-level vdevs on spinny
disks.

Speed up the import by using a taskqueue to parallelize vdev_load across
all top-level vdevs.

This also requires adding mutex protection to
metaslab_class_t.mc_historgram.  The mc_histogram fields were
unprotected when that code was first written in "Illumos 4976-4984 -
metaslab improvements" (OpenZFS
f3a7f6610f).  The lock wasn't added until
3dfb57a35e, though it's unclear exactly
which fields it's supposed to protect.  In any case, it wasn't until
vdev_load was parallelized that any code attempted concurrent access to
those fields.

Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #11470
2021-01-26 19:35:59 -08:00
Matthew Ahrens
aa755b3549
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-21 15:12:54 -08:00
Brian Atkinson
d0cd9a5cc6
Extending FreeBSD UIO Struct
In FreeBSD the struct uio was just a typedef to uio_t. In order to
extend this struct, outside of the definition for the struct uio, the
struct uio has been embedded inside of a uio_t struct.

Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear
this is a ZFS interface.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11438
2021-01-20 21:27:30 -08:00
Matthew Ahrens
e2af2acce3
allow callers to allocate and provide the abd_t struct
The `abd_get_offset_*()` routines create an abd_t that references
another abd_t, and doesn't allocate any pages/buffers of its own.  In
some workloads, these routines may be called frequently, to create many
abd_t's representing small pieces of a single large abd_t.  In
particular, the upcoming RAIDZ Expansion project makes heavy use of
these routines.

This commit adds the ability for the caller to allocate and provide the
abd_t struct to a variant of `abd_get_offset_*()`.  This eliminates the
cost of allocating the abd_t and performing the accounting associated
with it (`abdstat_struct_size`).  The RAIDZ/DRAID code uses this for
the `rc_abd`, which references the zio's abd.  The upcoming RAIDZ
Expansion project will leverage this infrastructure to increase
performance of reads post-expansion by around 50%.

Additionally, some of the interfaces around creating and destroying
abd_t's are cleaned up.  Most significantly, the distinction between
`abd_put()` and `abd_free()` is eliminated; all types of abd_t's are
now disposed of with `abd_free()`.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #8853 
Closes #11439
2021-01-20 11:24:37 -08:00
Matthew Ahrens
2ac90457f5
record ioctl elapsed time in zpool history
Each zfs ioctl that changes on-disk state (e.g. set property, create
snapshot, destroy filesystem) is recorded in the zpool history, and is
printed by `zpool history -i`.

For performance diagnostic purposes, it would be useful to know how long
each of these ioctls took to run.  This commit adds that functionality,
with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist.

Additionally, the time recorded in this history log is currently the
time that the history record is written to disk.  But in many cases (CLI
args logging and ioctl logging), this happens asynchronously,
potentially many seconds after the operation completed.  This commit
changes the timestamp to reflect when the history event was created,
rather than when it was written to disk.

Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11440
2021-01-11 09:29:25 -08:00
Toomas Soome
40ab927ae8
implicit conversion from 'boolean_t' to 'ds_hold_flags_t'
Build error on illumos with gcc 10 did reveal:

In function 'dmu_objset_refresh_ownership':
../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion
from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'}
[-Werror=enum-conversion]
      857 |  dsl_dataset_disown(ds, decrypt, tag);
          |                         ^~~~~~~
cc1: all warnings being treated as errors

libzfs_input_check.c: In function 'zfs_ioc_input_tests':
libzfs_input_check.c:754:28: error: implicit conversion from
'enum dmu_objset_type' to 'enum lzc_dataset_type'
[-Werror=enum-conversion]
  754 |  err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0);
      |                            ^~~~~~~~~~~
cc1: all warnings being treated as errors

The same issue is present in openzfs, and also the same issue about
ds_hold_flags_t, which currently defines exactly one valid value.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #11406
2020-12-27 16:31:02 -08:00
Matthew Ahrens
be5c6d9653
Only examine best metaslabs on each vdev
On a system with very high fragmentation, we may need to do lots of gang
allocations (e.g. most indirect block allocations (~50KB) may need to
gang). Before failing a "normal" allocation and resorting to ganging, we
try every metaslab.  This has the impact of loading every metaslab (not
a huge deal since we now typically keep all metaslabs loaded), and also
iterating over every metaslab for every failing allocation. If there are
many metaslabs (more than the typical ~200, e.g. due to vdev expansion
or very large vdevs), the CPU cost of this iteration can be very
impactful.  This iteration is done with the mg_lock held, creating long
hold times and high lock contention for concurrent allocations,
ultimately causing long txg sync times and poor application performance.

To address this, this commit changes the behavior of "normal" (not
try_hard, not ZIL) allocations.  These will now only examine the 100
best metaslabs (as determined by their ms_weight).  If none of these
have a large enough free segment, then the allocation will fail and
we'll fall back on ganging.

To accomplish this, we will now (normally) gang before doing a
`try_hard` allocation.  Non-try_hard allocations will only examine the
100 best metaslabs of each vdev.  In summary, we will first try normal
allocation.  If that fails then we will do a gang allocation.  If that
fails then we will do a "try hard" gang allocation.  If that fails then
we will have a multi-layer gang block.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11327
2020-12-16 14:40:05 -08:00
Alexander Motin
f8020c9363
Make metaslab class rotor and aliquot per-allocator.
Metaslab rotor and aliquot are used to distribute workload between
vdevs while keeping some locality for logically adjacent blocks.  Once
multiple allocators were introduced to separate allocation of different
objects it does not make much sense for different allocators to write
into different metaslabs of the same metaslab group (vdev) same time,
competing for its resources.  This change makes each allocator choose
metaslab group independently, colliding with others only sporadically.

Test including simultaneous write into 4 files with recordsize of 4KB
on a striped pool of 30 disks on a system with 40 logical cores show
reduction of vdev queue lock contention from 54 to 27% due to better
load distribution.  Unfortunately it won't help much ZVOLs yet since
only one dataset/ZVOL is synced at a time, and so for the most part
only one allocator is used, but it may improve later.

While there, to reduce the number of pointer dereferences change
per-allocator storage for metaslab classes and groups from several
separate malloc()'s to variable length arrays at the ends of the
original class and group structures.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11288
2020-12-15 10:55:44 -08:00
Ryan Libby
956f94010f
spa: avoid type narrowing warning
Building the spa module for i386 caused gcc to emit
-Wint-to-pointer-cast "cast to pointer from integer of different size"
because spa.spa_did was uint64_t but pthread_join (via thread_join in
spa_deactivate) takes a pointer (32-bit on i386).  Define spa_did to be
pointer-size instead.  For now spa_did is in fact never non-zero and the
thread_join could instead be ifdef'd out, but changing the size of
spa_did may be more useful for the future.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Closes #11336
2020-12-15 09:20:06 -08:00
Matthew Macy
923d730329
dmu_zfetch: fix memory leak
The last change caused the read completion callback to not be called
if the IO was still in progress. This change restores allocation
of the arc buf callback, but in the callback path checks the new
acb_nobuf field to know to skip buffer allocation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11324
2020-12-12 16:00:00 -08:00
Matthew Ahrens
ba67d82142
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 10:26:02 -08:00
Paul Dagnelie
60a4c7d2a2
Implement memory and CPU hotplug
ZFS currently doesn't react to hotplugging cpu or memory into the 
system in any way. This patch changes that by adding logic to the ARC 
that allows the system to take advantage of new memory that is added 
for caching purposes. It also adds logic to the taskq infrastructure 
to support dynamically expanding the number of threads allocated to a 
taskq.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11212
2020-12-10 14:09:23 -08:00
Matthew Macy
1e4732cbda
Decouple arc_read_done callback from arc buf instantiation
Add ARC_FLAG_NO_BUF to indicate that a buffer need not be
instantiated.  This fixes a ~20% performance regression on
cached reads due to zfetch changes.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11220 
Closes #11232
2020-12-09 15:05:06 -08:00
Alexander Motin
6f5aac3ca0
Reduce latency effects of non-interactive I/O
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
2020-11-24 09:26:42 -08:00
Matthew Macy
0ca45cb310
Fix problems in zvol_set_volmode_impl
- Don't leave fstrans set when passed a snapshot
- Don't remove minor if volmode already matches new value
- (FreeBSD) Wait for GEOM ops to complete before trying
  remove (at create time GEOM will be "tasting" in parallel)
- (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL
- (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11199
2020-11-17 09:50:52 -08:00
loli10K
4072f465bc
Fix 'zfs userspace' for received datasets in encrypted root
For encrypted receives, where user accounting is initially disabled on
creation, both 'zfs userspace' and 'zfs groupspace' fails with
EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to
set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a
successful dmu_objset_space_upgrade().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9501 
Closes #9596
2020-11-16 09:10:29 -08:00
Matthew Ahrens
d66aab7c08
Assertion failure when logging large output of channel program
The output of ZFS channel programs is logged on-disk in the zpool
history, and printed by `zpool history -i`.  Channel programs can use
10MB of memory by default, and up to 100MB by using the `zfs program -m`
flag.  Therefore their output can be up to some fraction of 100MB.

In addition to being somewhat wasteful of the limited space reserved for
the pool history (which for large pools is 1GB), in extreme cases this
can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in
`dmu_buf_hold_array_by_dnode()`.

This commit limits the output size that will be logged to 1MB.  Larger
outputs will not be logged, instead a entry will be logged indicating
the size of the omitted output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11194
2020-11-14 10:17:16 -08:00
Brian Behlendorf
b2255edcc0
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID.  This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.

A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`.  No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.

    zpool create <pool> draid[1,2,3] <vdevs...>

Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons.  The supported options include:

    zpool create <pool> \
        draid[<parity>][:<data>d][:<children>c][:<spares>s] \
        <vdevs...>

    - draid[parity]       - Parity level (default 1)
    - draid[:<data>d]     - Data devices per group (default 8)
    - draid[:<children>c] - Expected number of child vdevs
    - draid[:<spares>s]   - Distributed hot spares (default 0)

Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.

```
  pool: tank
 state: ONLINE
config:

    NAME                  STATE     READ WRITE CKSUM
    slag7                 ONLINE       0     0     0
      draid2:8d:68c:2s-0  ONLINE       0     0     0
        L0                ONLINE       0     0     0
        L1                ONLINE       0     0     0
        ...
        U25               ONLINE       0     0     0
        U26               ONLINE       0     0     0
        spare-53          ONLINE       0     0     0
          U27             ONLINE       0     0     0
          draid2-0-0      ONLINE       0     0     0
        U28               ONLINE       0     0     0
        U29               ONLINE       0     0     0
        ...
        U42               ONLINE       0     0     0
        U43               ONLINE       0     0     0
    special
      mirror-1            ONLINE       0     0     0
        L5                ONLINE       0     0     0
        U5                ONLINE       0     0     0
      mirror-2            ONLINE       0     0     0
        L6                ONLINE       0     0     0
        U6                ONLINE       0     0     0
    spares
      draid2-0-0          INUSE     currently in use
      draid2-0-1          AVAIL
```

When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command.  These options are leverages
by zloop.sh to test a wide range of dRAID configurations.

    -K draid|raidz|random - kind of RAID to test
    -D <value>            - dRAID data drives per group
    -S <value>            - dRAID distributed hot spares
    -R <value>            - RAID parity (raidz or dRAID)

The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.

Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-13 13:51:51 -08:00
Mateusz Guzik
1a0b4f566c
G/C struct znode -> z_moved
The field is yet another leftover from unsupported zfs_znode_move.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11186
2020-11-10 12:42:47 -08:00
Ryan Moeller
8a9634e2f3 Remove redundant oid parameter to update_pages
The oid comes from the znode we are already passing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11176
2020-11-10 10:54:30 -08:00
Brian Behlendorf
8c7d604c62 Linux 5.10 compat: frame.h renamed objtool.h
In Linux 5.10 the linux/frame.h header was renamed linux/objtool.h.
Add a configure check to detect and use the correctly named header.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11085
2020-11-02 22:01:10 +00:00
Christian Schwarz
ab8c935ea6
zfs_vnops: make zfs_get_data OS-independent
Move zfs_get_data() in to platform-independent code. The only
platform-specific aspect of it is the way we release an inode 
(Linux) / vnode_t (FreeBSD). I am not aware of a platform that
could be supported by ZFS that couldn't implement zfs_rele_async 
itself. It's sibling zvol_get_data already is platform-independent.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #10979
2020-11-02 12:07:07 -08:00
Mateusz Guzik
09eb36ce3d
Introduce CPU_SEQID_UNSTABLE
Current CPU_SEQID users don't care about possibly changing CPU ID, but
enclose it within kpreempt disable/enable in order to fend off warnings
from Linux's CONFIG_DEBUG_PREEMPT.

There is no need to do it. The expected way to get CPU ID while allowing
for migration is to use raw_smp_processor_id.

In order to make this future-proof this patch keeps CPU_SEQID as is and
introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers
explicitly want this behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11142
2020-11-02 11:51:12 -08:00
Matthew Macy
8583540c6e
Consolidate zfs_holey and zfs_access
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11125
2020-10-31 09:40:08 -07:00
Mateusz Guzik
c4ede65bdf
zstd: track allocator statistics
Note that this only tracks sizes as requested by the caller.
Actual allocated space will almost always be bigger (e.g., rounded up to
the next power of 2 or page size). Additionally the allocated buffer may
be holding other areas hostage. Nonetheless, this is a starting point
for tracking memory usage in zstd.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11129
2020-10-30 15:26:10 -07:00
Matthew Macy
5fa356ea44
Remove UIO_ZEROCOPY functions structures
The original xuio zero copy functionality has always been unused 
on Linux and FreeBSD.  Remove this disabled code to avoid any
confusion and improve readability.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11124
2020-10-30 10:00:33 -07:00
Ryan Moeller
76d04993a6
Update references to nonexistent man pages in code
Refer to the correct section or alternative for FreeBSD and Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11132
2020-10-30 08:55:59 -07:00
Matthew Macy
e53d678d4a
Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD.  With a little refactoring they can be
moved to the common code which is what is done by this commit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11078
2020-10-21 14:08:06 -07:00
Ryan Moeller
485b50bb9e
Cross-platform acltype
The acltype property is currently hidden on FreeBSD and does not
reflect the NFSv4 style ZFS ACLs used on the platform.  This makes it
difficult to observe that a pool imported from FreeBSD on Linux has a
different type of ACL that is being ignored, and vice versa.

Add an nfsv4 acltype and expose the property on FreeBSD.

Make the default acltype nfsv4 on FreeBSD.

Setting acltype to an unhanded style is treated the same as setting
it to off.  The ACLs will not be removed, but they will be ignored.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10520
2020-10-13 21:25:48 -07:00
Warner Losh
b302185a92
FreeBSD: make adjustments for the standalone environment
In FreeBSD, there are three compile environments that are supported:
user land, the kernel and the bootloader / standalone. Adjust the
headers to compile in the standalone environment. Limit kernel-only
items from view when _STANDALONE is defined.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Warner Losh <imp@FreeBSD.org>
Closes #10998
2020-10-13 21:05:49 -07:00
Christian Schwarz
701f656b97
dmu.h: remove stale declaration dmu_objset_snapshot_tmp
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11047
2020-10-13 16:46:00 -07:00
Christian Schwarz
61868bb14d
zil_parse: make callback parameters const
Code cleanup, a follow up commit to 4d55ea81.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11020
2020-10-09 09:34:54 -07:00
Brian Behlendorf
d0249a4bd0
Replace ZFS on Linux references with OpenZFS
This change updates the documentation to refer to the project
as OpenZFS instead ZFS on Linux.  Web links have been updated
to refer to https://github.com/openzfs/zfs.  The extraneous
zfsonlinux.org web links in the ZED and SPL sources have been
dropped.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11007
2020-10-08 20:10:13 -07:00
Ryan Moeller
4d55ea811d
Throw const on some strings
In C, const indicates to the reader that mutation will not occur.
It can also serve as a hint about ownership.

Add const in a few places where it makes sense.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10997
2020-10-02 17:44:10 -07:00
Sebastian Gottschall
8a171ccd92
do a cyclic seek for unused memory objects in pool
In non regular use cases allocated memory might stay persistent in memory
pool. This small patch checks every minute if there are old objects which
can be released from memory pool.

Right now with regular use, the pool is checked for old objects on each
allocation attempt from this pool. so basically polling by its use. Now
consider what happens if someone writes a lot of files and stops use of
the volume or even unmounts it. So the code will no longer check if
objects can be released from the pool. Already allocated objects will
still stay in pool cache. this is no big issue for common use. But
someone discovered this issue while doing tests. personally i know this
behavior and I'm aware of it. Its no big issue. just a enhancement

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10938 
Closes #10969
2020-09-30 13:22:34 -07:00
Brian Behlendorf
96951e0327
Fix objtool configure check
The m4 objtool configure check can incorrectly fail because of a
missing header in the test.  This appears to be the result of a
recent kernel change and was observed on the Fedora 5.8.11-200
kernel.

  In file included from /home/fedora/zfs/build/objtool/objtool.c:75:
  ./arch/x86/include/asm/frame.h💯57: error: 'struct pt_regs'
      declared inside parameter list will not be visible outside
      of this definition or declaration [-Werror]

The consequence of this is that the "stack_frame_non_standard"
check is never run and HAVE_STACK_FRAME_NON_STANDARD is set
incorrectly which results in a build failure.  This change adds
the appropriate header to the "objtool" check so it now behaves
as intended.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10990
2020-09-28 16:40:50 -07:00
Matthew Macy
af20b97078
zfetch: Don't issue new streams when old have not completed
The current dmu_zfetch code implicitly assumes that I/Os complete
within min_sec_reap seconds. With async dmu and a readonly workload
(and thus no exponential backoff in operations from the "write
throttle") such as L2ARC rebuild it is possible to saturate the drives
with I/O requests. These are then effectively compounded with prefetch
requests.

This change reference counts streams and prevents them from being
recycled after their min_sec_reap timeout if they still have
outstanding I/Os.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10900
2020-09-27 17:08:38 -07:00
Matthew Macy
7b8363d7f0
FreeBSD: Add support for procfs_list
The procfs_list interface is required by several kstats. Implement
this functionality for FreeBSD to provide access to these kstats.
                           
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10890
2020-09-23 16:43:51 -07:00
Ryan Moeller
5f8a9e6a02
FreeBSD: Reduce stack usage of Lua
Use the same reduced buffer size for lauxlib that is used on Linux.

Fixes panic on HEAD in lua gsub test designed to exhaust stack space.

With this we can remove the special case to reserve more stack space
on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10959
2020-09-22 16:03:11 -07:00
George Wilson
c494aa7f57
vdev_ashift should only be set once
== Motivation and Context

The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.

In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.

== Description

This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes #10932
2020-09-18 12:13:47 -07:00
Ryan Moeller
7ead2be3d2
Rename acltype=posixacl to acltype=posix
Prefer acltype=off|posix, retaining the old names as aliases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10918
2020-09-16 12:26:06 -07:00
Toomas Soome
1db9e6e4e4
zfs label bootenv should store data as nvlist
nvlist does allow us to support different data types and systems.

To encapsulate user data to/from nvlist, the libzfsbootenv library is
provided.

Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #10774
2020-09-15 15:42:27 -07:00
George Amanakis
085321621e
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.

This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.

L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.

When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.

Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.

Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
  an ARC buffer. However, prefetches may be issued in a way that
  arc_read_done() is bypassed. Instead, move the related code in
  l2arc_write_eligible() to account for those cases too.

Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 10:10:44 -07:00
Ryan Moeller
ebc4b52369
Avoid possibility of division by zero
When hz > 1000, msec / (1000 / hz) results in division by zero.

I found somewhere in FreeBSD using howmany(msec * hz, 1000) to convert
ms to ticks, avoiding the potential for a zero in the divisor.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10894
2020-09-08 11:39:16 -07:00
Don Brady
4f07282786
Avoid posting duplicate zpool events
Duplicate io and checksum ereport events can misrepresent that 
things are worse than they seem. Ideally the zpool events and the 
corresponding vdev stat error counts in a zpool status should be 
for unique errors -- not the same error being counted over and over. 
This can be demonstrated in a simple example. With a single bad 
block in a datafile and just 5 reads of the file we end up with a 
degraded vdev, even though there is only one unique error in the pool.

The proposed solution to the above issue, is to eliminate duplicates 
when posting events and when updating vdev error stats. We now save 
recent error events of interest when posting events so that we can 
easily check for duplicates when posting an error. 

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #10861
2020-09-04 10:34:28 -07:00
Matthew Ahrens
3808032489
nowait synctask must succeed
If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with
`dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC.
However, there is no way to notice or communicate this failure, so it's
extremely difficult to use this functionality correctly, and in fact
almost all callers use `ZFS_SPACE_CHECK_NONE`.

This commit removes the `zfs_space_check_t` argument from
`dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10855
2020-09-04 10:29:39 -07:00
Matthew Macy
ac6e5fb202
Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriate
There are a number of places where cv_?_sig is used simply for
accounting purposes but the surrounding code has no ability to
cope with actually receiving a signal. On FreeBSD it is possible
to send signals to individual kernel threads so this could
enable undesirable behavior.

This patch adds routines on Linux that will do the same idle
accounting as _sig without making the task interruptible. On
FreeBSD cv_*_idle  are all aliases for cv_*

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10843
2020-09-03 20:04:09 -07:00
Ryan Moeller
7b4e27232d
Add 'zfs rename -u' to rename without remounting
Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' property set to
'legacy' or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even
mounted back.

This introduces layering violation, as we need to update
'f_mntfromname' field in statfs structure related to mountpoint (for
the dataset we are renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even
cleaner way - in ZFS-only configuration root file system is ZFS file
system with 'mountpoint' property set to 'legacy'. If root dataset is
named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone
it (system/oldrootfs), update FreeBSD and if it doesn't boot we can
boot back from system/oldrootfs and rename it back to system/rootfs
while it is mounted as /. Before it was not possible, because
unmounting / was not possible.

Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Matt Macy <mmacy@freebsd.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10839
2020-09-01 16:14:16 -07:00
Ryan Moeller
a2f944a140
zpool: Change base URL for ZFS messages to openzfs-docs
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10820
2020-08-26 21:43:06 -07:00
Sebastian Gottschall
184df27eef
Avoid symbol collision with in-kernel zstdlib
For Linux, when zfs is compiled as an in kernel static variant
and the in kernel zstd library is compiled statically into the kernel
a symbol collision will occur.  This wrapper header renames all
of the relevant zstd functions to avoid this problem.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10775
2020-08-24 12:20:41 -07:00
Chris McDonough
c3b03d0701
Appease GCC sprintf warnings found on Fedora 32/GCC 10.0.1
Increase the size of DDT_NAMELEN and MNT_LINE_MAX to appease GCC
snprintf truncation warnings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris McDonough <chrism@plope.com>
Closes #10712
Closes #10766
2020-08-24 10:32:59 -07:00
Ryan Moeller
6fe3498ca3
Import vdev ashift optimization from FreeBSD
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

 1. Existing pools created with devices that have different logical
    and physical block sizes, but were configured to use the logical
    block size (e.g. because the OS version used for pool construction
    reported the logical block size instead of the physical block
    size) will suddenly find that the vdev allocation size has
    increased. This can be easily tolerated for active members of
    the array, but ZFS would prevent replacement of a vdev with
    another identical device because it now appears that the smaller
    allocation size required by the pool is not supported by the new
    device.

 2. The device's physical block size may be too large to be supported
    by ZFS. The optimal allocation size for the vdev may be quite
    large. For example, a RAID controller may export a vdev that
    requires read-modify-write cycles unless accessed using 64k
    aligned/sized requests. ZFS currently has an 8k minimum block
    size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Macy <mmacy@freebsd.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10619
2020-08-21 12:53:17 -07:00
Matthew Macy
1c2725a157
FreeBSD: 11.x arc_stats compatibility
Removing other_size from arc_stats breaks top in 11.x jails
running on HEAD.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10745
2020-08-20 10:55:02 -07:00
Michael Niewöhner
10b3c7f5e4 Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:

- zstd: A basic ZStandard compression algorithm Available compression.
  Levels for zstd are zstd-1 through zstd-19, where the compression
  increases with every level, but speed decreases.

- zstd-fast: A faster version of the ZStandard compression algorithm
  zstd-fast is basically a "negative" level of zstd. The compression
  decreases with every level, but speed increases.

  Available compression levels for zstd-fast:
   - zstd-fast-1 through zstd-fast-10
   - zstd-fast-20 through zstd-fast-100 (in increments of 10)
   - zstd-fast-500 and zstd-fast-1000

For more information check the man page.

Implementation details:

Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.

The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers).  The upper bits are used to store the compression level.

It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.

All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables.  Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).

The userspace tools all use the combined/bit-shifted value.

Additional notes:

zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.

ZSTD is included with all current tests and new tests are added
as-needed.

Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born.  This is currently only used by zstd but can be
extended as needed.

Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-20 10:30:06 -07:00
Matthew Macy
5e7eaf8fbd
Add define to enable autotrim to default to on
In FreeBSD trim has defaulted to on for several
years. In order to minimize POLA violations on
import it's important to maintain this default
when importing vendored openzfs in to FreeBSD
base.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10719
2020-08-18 09:52:30 -07:00
Ryan Moeller
009cc8e884
Make zc_nvlist_src_size limit tunable
We limit the size of nvlists passed to the kernel so a user cannot make
the kernel do an unreasonably large allocation.  On FreeBSD this limit
was 128 kiB, which turns out to be a bit too small when doing some
operations involving a large number of datasets or snapshots, for
example replication.

Make this limit tunable, with a platform-specific auto default.
Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the
system limit on user wired memory, which allows it to scale depending
on system configuration.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Issue #6572 
Closes #10706
2020-08-18 09:33:55 -07:00
Matthew Ahrens
85ec5cbae2
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K.  Therefore, the space used to cache each block
is rounded up to a multiple of 4K.  The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat.  However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`.  This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is.  It also impacts how the ARC responds to memory
pressure.  As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.

If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste.  If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.

However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large.  On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.

This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`.  As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-17 20:04:04 -07:00
Matthew Ahrens
994de7e4b7
Remove KMC_KMEM and KMC_VMEM
`KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented
caches are `KMC_KVMEM`.

KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we
don't use kmalloc to back the SPL caches, instead we use kvmalloc
(KMC_KVMEM).  The flag, module parameter, /proc entries, and associated
code are removed.

KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to
vmalloc().  The flag, /proc entries, and associated code are removed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10673
2020-08-17 16:04:28 -07:00
Matthew Macy
5f1984f2f8
FreeBSD: fallback to /boot/ to look for zpool.cache
Up until now zpool.cache has always lived in /boot on FreeBSD.
For the sake of compatibility fallback to /boot if zpool.cache
isn't found in /etc/zfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10720
2020-08-17 14:43:47 -07:00
Ryan Moeller
ed726fb063
Move ZVOL_DIR back to zfs.h
This was previously moved because nothing else in-tree uses it, but
evidently DilOS uses it out of tree.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@freebsd.org>
Closes #10361 
Closes #10685
2020-08-11 13:12:12 -07:00
Allan Jude
8fb79fdddb
Change the error handling for invalid property values
ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist

With a compression= property outside of the understood range:

Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.

After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10631
2020-08-01 08:41:31 -07:00
Matthew Ahrens
3442c2a02d
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-07-31 21:10:52 -07:00
Matthew Macy
27d96d2254
Rename refcount.h to zfs_refcount.h
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10620
2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos
843e9ca2e1
Introduce names for ZTHRs
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #10630
2020-07-29 09:43:33 -07:00
Matthew Macy
5678d3f593
Prefix zfs internal endian checks with _ZFS
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10621
2020-07-28 13:02:49 -07:00
Matthew Macy
e64cc4954c
Refactor ccompile.h to not include system headers
This is a step toward being able to vendor the OpenZFS code in FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10625
2020-07-25 20:09:50 -07:00
Kyle Evans
bfafe1780a
Annotate unused parameters on inline definitions as such
* libspl: umem: These are obviously and intentionally unused; annotate 
  them as such to appease -Wunused-parameter builds that include this 
  header.

* sys/dmu.h: In this case, clear_on_evict_dbufp is only used for 
  ZFS_DEBUG builds, so annotate it as __maybe_unused to appease 
  -Wunused-parameter.


Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #10606
2020-07-23 17:41:48 -07:00
Ryan Moeller
f7a68f99d0
FreeBSD: Remove some code duplication in sysctl_os.c
Drop unnecessary redefinition's of several arcstat values.
Put missing extern declaration of arc_no_grow_shift in arc_impl.h.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10609
2020-07-23 17:35:34 -07:00
Matthew Ahrens
5dd92909c6
Adjust ARC terminology
The process of evicting data from the ARC is referred to as
`arc_adjust`.

This commit changes the term to `arc_evict`, which is more specific.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10592
2020-07-22 09:51:47 -07:00
Ryan Moeller
0421f257b2
FreeBSD: Add legacy arc_min and arc_max
These tunables were renamed from vfs.zfs.arc_min and 
vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max.
Add legacy compat tunables for the old names.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10579
2020-07-19 10:15:34 -07:00
Matthew Ahrens
6774931dfa
Extend zdb to print inconsistencies in livelists and metaslabs
Livelists and spacemaps are data structures that are logs of allocations
and frees.  Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.

These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).

ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab.  However, it
generally halts processing when the error is detected.

When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.

The specific checks added are:

Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]

Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
  spacemap log, then report any double FREEs and double ALLOCs

Verify that livelists are consistenet with spacemaps.  The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes #10515
2020-07-14 17:51:05 -07:00
George Wilson
c15d36c674
Remove dependency on sharetab file and refactor sharing logic
== Motivation and Context

The current implementation of 'sharenfs' and 'sharesmb' relies on
the use of the sharetab file. The use of this file is os-specific
and not required by linux or freebsd. Currently the code must
maintain updates to this file which adds complexity and presents
a significant performance impact when sharing many datasets. In
addition, concurrently running 'zfs sharenfs' command results in
missing entries in the sharetab file leading to unexpected failures.

== Description

This change removes the sharetab logic from the linux and freebsd
implementation of 'sharenfs' and 'sharesmb'. It still preserves an
os-specific library which contains the logic required for sharing
NFS or SMB. The following entry points exist in the vastly simplified
libshare library:

- sa_enable_share -- shares a dataset but may not commit the change
- sa_disable_share -- unshares a dataset but may not commit the change
- sa_is_shared -- determine if a dataset is shared
- sa_commit_share -- notify NFS/SMB subsystem to commit the shares
- sa_validate_shareopts -- determine if sharing options are valid

The sa_commit_share entry point is provided as a performance enhancement
and is not required. The sa_enable_share/sa_disable_share may commit
the share as part of the implementation. Libshare provides a framework
for both NFS and SMB but some operating systems may not fully support
these protocols or all features of the protocol.

NFS Operation:
For linux, libshare updates /etc/exports.d/zfs.exports to add
and remove shares and then commits the changes by invoking
'exportfs -r'. This file, is automatically read by the kernel NFS
implementation which makes for better integration with the NFS systemd
service. For FreeBSD, libshare updates /etc/zfs/exports to add and
remove shares and then commits the changes by sending a SIGHUP to
mountd.

SMB Operation:
For linux, libshare adds and removes files in /var/lib/samba/usershares
by calling the 'net' command directly. There is no need to commit the
changes. FreeBSD does not support SMB.

== Performance Results

To test sharing performance we created a pool with an increasing number
of datasets and invoked various zfs actions that would enable and
disable sharing. The performance testing was limited to NFS sharing.
The following tests were performed on an 8 vCPU system with 128GB and
a pool comprised of 4 50GB SSDs:

Scale testing:
- Share all filesystems in parallel -- zfs sharenfs=on <dataset> &
- Unshare all filesystems in parallel -- zfs sharenfs=off <dataset> &

Functional testing:
- share each filesystem serially -- zfs share -a
- unshare each filesystem serially -- zfs unshare -a
- reset sharenfs property and unshare -- zfs inherit -r sharenfs <pool>

For 'zfs sharenfs=on' scale testing we saw an average reduction in time
of 89.43% and for 'zfs sharenfs=off' we saw an average reduction in time
of 83.36%.

Functional testing also shows a huge improvement:
- zfs share -- 97.97% reduction in time
- zfs unshare -- 96.47% reduction in time
- zfs inhert -r sharenfs -- 99.01% reduction in time

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Bryant G. Ly <bryangly@gmail.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-68690
Closes #1603
Closes #7692
Closes #7943
Closes #10300
2020-07-13 09:19:18 -07:00
Matthew Ahrens
e59a377a8f
filesystem_limit/snapshot_limit is incorrectly enforced against root
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit."  Two types of users are allowed to change
the limit:

1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`.  This works properly.

2. A user with elevated system privileges (e.g. root).  This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.

The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process.  This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.

This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged.  Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.

This problem only impacts Linux.  Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8226
Closes #10545
2020-07-11 17:18:02 -07:00
Mark Johnston
6e00561712 Add a "try" operation for range locks
zfs_rangelock_tryenter() bails immediately instead of waiting for the
lock to become available.  This will be used to resolve a deadlock in
the FreeBSD page-in code.  No functional change intended.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #10519
2020-07-06 11:53:31 -07:00
Brian Behlendorf
9a49d3f3d3
Add device rebuild feature
The device_rebuild feature enables sequential reconstruction when
resilvering.  Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices.  However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.

The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.

    zpool attach -s <pool> <existing vdev> <new vdev>
    zpool replace -s <pool> <old vdev> <new vdev>

The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.

The `zpool wait -t resilver` command was extended to wait on
sequential resilvers.  From this perspective they are no different
than healing resilvers.

Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.

As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory.  Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.

Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10349
2020-07-03 11:05:50 -07:00
Matthew Ahrens
3c42c9ed84
Clean up OS-specific ARC and kmem code
OS-specific code (e.g. under `module/os/linux`) does not need to share
its code structure with any other operating systems.  In particular, the
ARC and kmem code need not be similar to the code in illumos, because we
won't be syncing this OS-specific code between operating systems.  For
example, if/when illumos support is added to the common repo, we would
add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of
this code.

Therefore, we can simplify the code in the OS-specific ARC and kmem
routines.

These changes do not impact system behavior, they are purely code
cleanup.  The changes are:

Arenas are not used on Linux or FreeBSD (they are always `NULL`), so
`heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along
with code that uses them.

In `arc_available_memory()`:
 * `desfree` is unused, remove it
 * rename `freemem` to avoid conflict with pre-existing `#define`
 * remove checks related to arenas
 * use units of bytes, rather than converting from bytes to pages and
   then back to bytes

`SPL_KMEM_CACHE_REAP` is unused, remove it.

`skc_reap` is unused, remove it.

The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove
it.

`vmem_size()` and associated type and macros are unused, remove them.

In `arc_memory_throttle()`, use a less confusing variable name to store
the result of `arc_free_memory()`.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10499
2020-06-29 09:01:07 -07:00
Arvind Sankar
4d8e68c42f Avoid installing kernel headers on FreeBSD
The kernel headers are installed for DKMS on linux, so don't install
them unless we're building on linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10506
2020-06-27 17:40:14 -07:00
Arvind Sankar
6b99fc0620 Fixes for make dist
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.

Add some files that were missing altogether in various Makefile's.

The changes to disted files in this commit (excluding deleted files):

+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10501
2020-06-26 14:20:02 -07:00
Matthew Ahrens
67c0f0dedc
ARC shrinking blocks reads/writes
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5527, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5527.

Note: commit 3ec34e5527 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes #10496
2020-06-26 10:42:27 -07:00
Arvind Sankar
7513807320 Drop unnecessary srcdir paths
There's no need to specify the srcdir explicitly in _HEADERS and
EXTRA_DIST.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10493
2020-06-24 18:20:18 -07:00
Ryan Moeller
9192f27c1d
Add zfs_multihost_interval tunable handler for FreeBSD
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.

Add the handler so the tunable can be declared in common code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10490
2020-06-23 13:32:42 -07:00
Arvind Sankar
0ce2de637b Add prototypes
Add prototypes/move prototypes to header files.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:32 -07:00
Arvind Sankar
c3fe42aabd Remove dead code
Delete unused functions.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:18 -07:00
Matthew Macy
8056a75672
Disambiguate condvar API contract
On Illumos callers of cv_timedwait and cv_timedwait_hires
can't distinguish between whether or not the cv was signaled
or the call timed out. Illumos handles this (for some definition
of handles) by calling cv_signal in the return path if we were
signaled but the return value indicates instead that we timed
out. This would make sense if it were possible to query the the
cv for its net signal disposition. However, this isn't possible
and, in spite of the fact that there are places in the code that
clearly take a different and incompatible path if a timeout value
is indicated, this distinction appears to be rather subtle to most
developers. This problem is further compounded by the fact that on
Linux, calling cv_signal in the return path wouldn't even do the
right thing unless there are other waiters.

Since it is possible for the caller to independently determine how
much time is remaining but it is not possible to query if the cv
was in fact signaled, prioritizing signalling over timeout seems
like a cleaner solution. In addition, judging from usage patterns
within the code itself, it is also less error prone.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10471
2020-06-18 10:17:50 -07:00
Matthew Macy
7564073ed6
Add abd_cache_reap_now for abd_chunk_cache users
Apparently missed in the initial port integration was
the need to reap the abd_chunk_cache on FreeBSD. This
change addresses that oversight.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10474
2020-06-17 21:44:13 -07:00
Jorgen Lundman
cd07d7c83f
drr_begin: can't forward declare untagged struct
When compiling with Clang++ it does not allow for untagged structs, so
struct ddr_begin needs to be declared before the struct that uses it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10453
2020-06-16 11:57:04 -07:00
Jorgen Lundman
d366c8fd7a
Make struct vdev_disk_t be platform private
Linux defines different vdev_disk_t members to macOS, but they are
only used in vdev_disk.c so move the declaration there.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10452
2020-06-16 11:43:33 -07:00
George Amanakis
f2edc0078f
Fix gcc10.1 truncation error
gcc10.1 complains with:

../../include/sys/dmu.h:373:24: error: ‘%s’ directive output may be
truncated writing up to 95 bytes into a region of size 75
[-Werror=format-truncation=]
  373 | #define DMU_POOL_DDT   "DDT-%s-%s-%s"
      |                        ^~~~~~~~~~~~~~
../../module/zfs/ddt.c:256:37: note: in expansion of macro
‘DMU_POOL_DDT’
  256 |  (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
      |                                     ^~~~~~~~~~~~
../../include/sys/dmu.h:373:32: note: format string is defined here
  373 | #define DMU_POOL_DDT   "DDT-%s-%s-%s"
      |                                ^~
../../module/zfs/ddt.c:256:9: note: ‘snprintf’ output 7 or more bytes
(assuming 102) into a destination of size 80
  256 |  (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  257 |      zio_checksum_table[ddt->ddt_checksum].ci_name,
      |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  258 |      ddt_ops[type]->ddt_op_name, ddt_class_name[class]);
      |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Increasing DTT_NAMELEN fixes it.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10433
2020-06-13 11:02:00 -07:00
Brian Atkinson
e08b993396
Removing ZERO_PAGE abd_alloc_zero_scatter
For MIPS architectures on Linux the ZERO_PAGE macro references
empty_zero_page, which is exported as a GPL symbol. The call to
ZERO_PAGE in abd_alloc_zero_scatter has been removed and a single
zero'd page is now allocated for each of the pages in abd_zero_scatter
in the kernel ABD code path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10428
2020-06-10 17:54:11 -07:00
Andrea Gelmini
dd4bc569b9
Fix typos
Correct various typos in the comments and tests.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #10423
2020-06-09 21:24:09 -07:00
Matthew Ahrens
7bcb7f0840
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224 
Closes #10383
2020-06-09 10:41:01 -07:00
George Amanakis
b7654bd794
Trim L2ARC
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
2020-06-09 10:15:08 -07:00
Pawel Jakub Dawidek
529246df96
Restore support for in-kernel ZFS ioctls
In Illumos it is possible to call ioctl functions from within the
kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support
that, but it doesn't hurt to keep it around, as all the code is there.

Before this commit it was a dead code and zc_iflags was always zero.
Restore this functionality by allowing to pass a flag to the
zfsdev_ioctl_common() function.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #10417
2020-06-08 13:57:22 -07:00
Jorgen Lundman
c9e319faae
Replace sprintf()->snprintf() and strcpy()->strlcpy()
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used.  If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().

The biggest change is adding a size parameter to zfs_id_to_fuidstr().

The various *_impl_get() functions are only used on linux and have
not yet been updated.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10400
2020-06-07 11:42:12 -07:00
Ryan Moeller
60265072e0
Improve compatibility with C++ consumers
C++ is a little picky about not using keywords for names, or string
constness.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10409
2020-06-06 12:54:04 -07:00
Allan Jude
4547fc4e07
Connect dataset_kstats for FreeBSD
Expand the FreeBSD spl for kstats to support all current types

Move the dataset_kstats_t back to zvol_state_t from zfs_state_os_t
now that it is common once again

```
kstat.zfs/mypool.dataset.objset-0x10b.nunlinked: 0
kstat.zfs/mypool.dataset.objset-0x10b.nunlinks: 0
kstat.zfs/mypool.dataset.objset-0x10b.nread: 150528
kstat.zfs/mypool.dataset.objset-0x10b.reads: 48
kstat.zfs/mypool.dataset.objset-0x10b.nwritten: 134217728
kstat.zfs/mypool.dataset.objset-0x10b.writes: 1024
kstat.zfs/mypool.dataset.objset-0x10b.dataset_name: mypool/datasetname
```

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10386
2020-06-05 17:17:02 -07:00
Jorgen Lundman
081de9a86d
Restore avl_update() calls and related functions
The macOS kmem implementation uses avl_update() and related
functions.  These same function exist in the Solaris AVL code but
were removed because they were unused.  Restore them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10390
2020-06-03 09:49:32 -07:00
Brian Atkinson
fb822260b1
Gang ABD Type
Adding the gang ABD type, which allows for linear and scatter ABDs to
be chained together into a single ABD.

This can be used to avoid doing memory copies to/from ABDs. An example
of this can be found in vdev_queue.c in the vdev_queue_aggregate()
function.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian <bwa@clemson.edu>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10069
2020-05-20 18:06:09 -07:00
Ryan Moeller
d2782af461
Fix ZVOL_DIR
We only use ZVOL_DIR on FreeBSD, and on FreeBSD it isn't correct.

Move the definition to the file where it is needed, and define it as
/dev/zvol/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10337
2020-05-16 10:10:38 -07:00
Brian Behlendorf
2ade659eb4
Fix abd_enter/exit_critical wrappers
Commit fc551d7 introduced the wrappers abd_enter_critical() and
abd_exit_critical() to mark critical sections.  On Linux these are
implemented with the local_irq_save() and local_irq_restore() macros
which set the 'flags' argument when saving.  By wrapping them with
a function the local variable is no longer set by the macro and is
no longer properly restored.

Convert abd_enter_critical() and abd_exit_critical() to macros to
resolve this issue and ensure the flags are properly restored.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10332
2020-05-14 20:45:16 -07:00
Matthew Ahrens
8b240f14f9
remove unneeded member drc_err of dmu_recv_cookie_t
The member drc_err of dmu_recv_cookie_t is used only locally in
receive_read, so we can replace it with a local variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10319
2020-05-14 12:10:29 -07:00
Brian Atkinson
fc551d7efb
Combine OS-independent ABD Code into Common Source File
Reorganizing ABD code base so OS-independent ABD code has been placed
into a common abd.c file. OS-dependent ABD code has been left in each
OS's ABD source files, and these source files have been renamed to
abd_os.

The OS-independent ABD code is now under:
module/zfs/abd.c
With the OS-dependent code in:
module/os/linux/zfs/abd_os.c
module/os/freebsd/zfs/abd_os.c

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10293
2020-05-10 12:23:52 -07:00
George Amanakis
657fd33bcf
Improvements on persistent L2ARC
Functional changes:

We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.

1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
   reflect the total aligned size of log blocks on the device. This is
   also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
   which reflect the total number of L2ARC log blocks present on cache
   devices.  It is also reflected in the header of the cache device as
   "dh_lb_count".

In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.

Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.

If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().

If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.

In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.

Non-functional changes:

Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.

Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.

Fix inaccurate comments for example in l2arc_log_blk_restore().

Add asserts at the end in l2arc_evict() and l2arc_write_buffers().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10228
2020-05-07 16:34:03 -07:00
Paul Dagnelie
108a454a46
Add support for boot environment data to be stored in the label
Modern bootloaders leverage data stored in the root filesystem to 
enable some of their powerful features. GRUB specifically has a grubenv 
file which can store large amounts of configuration data that can be 
read and written at boot time and during normal operation. This allows 
sysadmins to configure useful features like automated failover after 
failed boot attempts. Unfortunately, due to the Copy-on-Write nature 
of ZFS, the standard behavior of these tools cannot handle writing to
ZFS files safely at boot time. We need an alternative way to store 
data that allows the bootloader to make changes to the data.

This work is very similar to work that was done on Illumos to enable 
similar functionality in the FreeBSD bootloader. This patch is different 
in that the data being stored is a raw grubenv file; this file can store 
arbitrary variables and values, and the scripting provided by grub is 
powerful enough that special structures are not required to implement 
advanced behavior.

We repurpose the second padding area in each label to store the grubenv 
file, protected by an embedded checksum. We add two ioctls to get and 
set this data, and libzfs_core and libzfs functions to access them more 
easily. There are no direct command line interfaces to these functions; 
these will be added directly to the bootloader utilities.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10009
2020-05-07 09:36:33 -07:00
Matthew Ahrens
196bee4cfd
Remove deduplicated send/receive code
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156
Closes #10212
2020-04-23 10:06:57 -07:00
Matthew Ahrens
32d805c3e2
Use a struct to organize metaslab-group-allocator fields
Each metaslab group (of which there is one per top-level vdev) has
several (4, by default) "metaslab group allocators".  Each "allocator"
has its own metaslab that it prefers to allocate from (the "primary"
allocator), and each can perform allocations concurrently with the other
allocators.  In addition to the primary metaslab, there are several
other fields that need to be tracked separately for each allocator.
These are currently stored as several arrays in the metaslab_group_t,
each array indexed by allocator number.

This change organizes all the metaslab-group-allocator-specific fields
into a new struct, metaslab_group_allocator_t.  The metaslab_group_t now
needs only one array indexed by the allocator number - which contains
the metaslab_group_allocator_t's.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10213
2020-04-22 10:26:56 -07:00
George Amanakis
77f6826b83
Persistent L2ARC
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #925 
Closes #1823 
Closes #2672 
Closes #3744 
Closes #9582
2020-04-10 10:33:35 -07:00
Ryan Moeller
36a6e2335c
Don't ignore zfs_arc_max below allmem/32
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.

Add warning messages when tunables are being ignored.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10157
Closes #10158
2020-04-09 15:39:48 -07:00
Matthew Macy
8b27e08ed8
Add separate field for indicating that spa is in middle of split
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10178
2020-04-09 09:59:31 -07:00
Ryan Moeller
7e3df9db12
Finish refactoring for ZFS_MODULE_PARAM_CALL
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.

To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.

With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10179
2020-04-07 10:06:22 -07:00
Paul Dagnelie
5a42ef04fd
Add 'zfs wait' command
Add a mechanism to wait for delete queue to drain.

When doing redacted send/recv, many workflows involve deleting files 
that contain sensitive data. Because of the way zfs handles file 
deletions, snapshots taken quickly after a rm operation can sometimes 
still contain the file in question, especially if the file is very 
large. This can result in issues for redacted send/recv users who 
expect the deleted files to be redacted in the send streams, and not 
appear in their clones.

This change duplicates much of the zpool wait related logic into a 
zfs wait command, which can be used to wait until the internal
deleteq has been drained.  Additional wait activities may be added 
in the future. 

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9707
2020-04-01 10:02:06 -07:00
Ryan Moeller
9a51738b60
Let default arc_c_max be platform dependent
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator.  Other platforms
do not require the same logic.

Implement an arc_default_max() function to determine a default max ARC
size in platform code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10155
2020-03-27 09:14:46 -07:00
Matthew Ahrens
3f38797338
Compile cityhash code into libzfs
Make the cityhash code compile into libzfs, in preparation for the new
"zstream" command.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10152
2020-03-27 09:11:22 -07:00
Paul Dagnelie
7145123b0a
Separate warning for incomplete and corrupt streams
This change adds a separate return code to zfs_ioc_recv that is used 
for incomplete streams, in addition to the existing return code for 
streams that contain corruption.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10122
2020-03-17 10:30:33 -07:00
Matthew Ahrens
0fdd6106bb
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock.  For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it.  This may lead to deadlock, since the lock order is
reversed.

Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock.  However, there are a few places in
the send and receive code paths that do not.  For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.

This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock.  In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t).  Thus we do not need to acquire the dp_config_rwlock in
many new places.

I also cleaned up code in dmu_redact_snap() and send_traverse_thread().

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 10:55:02 -07:00
John Poduska
e6b28efccc
Prevent race condition in dnode_dest (#10101)
dnode_special_close() waits for the refcount of dn_holds to go to zero
without holding the dn_mtx. dnode_rele_and_unlock() does the final
remove to dn_holds with dn_mtx being held:

	refs = zfs_refcount_remove(&dn->dn_holds, tag);
	mutex_exit(&dn->dn_mtx);

So, there is a race condition after the remove until dn_mtx is
dropped. During that time, dnode_destroy() can get called, which ends
up in dnode_dest() calling mutex_destroy() and a panic since the lock
is still held.

This change adds a condvar to wait for the final dnode_rele_and_unlock()
to release the dn_mtx before calling dnode_destroy().

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #7814
Closes #10101
2020-03-12 10:25:56 -07:00
Matthew Ahrens
1dc32a67e9
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads.  This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks.  Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.

The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread.  It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`.  This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU.  It also induces later
costs by other threads:

 * Since the data was only prefetched, dmu_send()->dmu_dump_write() will
   need to call arc_read() again to get the data.  This will have to
   look up the arc_hdr in the hash table and copy the data from the
   scatter ABD in the arc_hdr to a linear ABD in arc_buf.  This takes
   27% of one CPU.

 * dmu_dump_write() needs to arc_buf_destroy()  This takes 11% of one
   CPU.

 * arc_adjust() will need to evict this arc_hdr, taking about 50% of one
   CPU.

All of these costs can be avoided by bypassing the ARC if the data is
not already cached.  This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.

The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second.  This change
increases the metric by 50%, from ~100,000 to ~150,000.  When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).

In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache.  In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 10:51:04 -07:00
Brian Behlendorf
2288d41968
Add trim support to zpool wait
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #10071
2020-03-04 15:07:11 -08:00
Matthew Ahrens
b3212d2fa6
Improve performance of zio_taskq_member
__zio_execute() calls zio_taskq_member() to determine if we are running
in a zio interrupt taskq, in which case we may need to switch to
processing this zio in a zio issue taskq.  The call to
zio_taskq_member() can become a performance bottleneck when we are
processing a high rate of zio's.

zio_taskq_member() calls taskq_member() on each of the zio interrupt
taskqs, of which there are 21.  This is slow because each call to
taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively
slow.

This commit improves the performance of zio_taskq_member() by having it
cache the value of tsd_get(taskq_tsd), reducing the number of those
calls to 1/21th of the current behavior.

In a test case running `zfs send -c >/dev/null` of a filesystem with
small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of
one CPU, and with this change it is reduced to 1.3%.  Overall time to
perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000
blocks/sec).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10070
2020-03-03 10:29:38 -08:00
Ryan Moeller
9bb907bc3f
Make spa_history_zone platform-dependent in kernel
This function should only return "linux" on Linux.

Move the kernel part of the function out of common code.
Fix the tests for FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10079
2020-03-02 09:43:30 -08:00
Matthew Macy
ae9f92f6f3
Re-share zfsdev_getminor and zfs_onexit_fd_hold
By adding a zfs_file_private accessor to the common
interfaces and some extensions to FreeBSD platform
code it is now possible to share the implementations
for the aforementioned functions.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10073
2020-02-28 14:50:32 -08:00
Brian Behlendorf
2c3a83701d Linux 5.6 compat: time_t
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired.  Callers are now required to use the time64_t type.

Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code.  They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.

Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.

Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10052
Closes #10064
2020-02-27 09:31:02 -08:00
Matthew Macy
28caa74b19
Refactor dnode dirty context from dbuf_dirty
* Add dedicated donde_set_dirtyctx routine.
* Add empty dirty record on destroy assertion.
* Make much more extensive use of the SET_ERROR macro.

Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9924
2020-02-26 16:09:17 -08:00
Dirkjan Bussink
327000ce04
Remove zfs_getattr and convoff dead code
The `convoff` function is called only in one code path in `zfs_space`.
Each caller of `zfs_space` is called with a `flock64_t` that has
`l_whence` set to `SEEK_SET`. This means that `convoff` always results
in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and
`int whence` is `SEEK_SET` as well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes #10006
2020-02-24 15:38:22 -08:00
Richard Laager
f244846462
Prefer org.openzfs for features and properties
Moving forward, we wish to use org.openzfs (no dash) rather than
org.open-zfs or org.zfsonlinux for feature GUIDs and property names.
The existing feature GUIDs cannot be changed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #10003
2020-02-18 09:36:50 -08:00
Jason King
13b5a4d5c0
Support setting user properties in a channel program
This adds support for setting user properties in a
zfs channel program by adding 'zfs.sync.set_prop'
and 'zfs.check.set_prop' to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Contributions-by: Jason King <jason.king@joyent.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9950
2020-02-14 13:41:42 -08:00
Matthew Ahrens
4fe3a842bb
Remove limit on number of async zio_frees of non-dedup blocks
The module parameter zfs_async_block_max_blocks limits the number of
blocks that can be freed by the background freeing of filesystems and
snapshots (from "zfs destroy"), in one TXG.  This is useful when freeing
dedup blocks, becuase each zio_free() of a dedup block can require an
i/o to read the relevant part of the dedup table (DDT), and will also
dirty that block.

zfs_async_block_max_blocks is set to 100,000 by default.  For the more
typical case where dedup is not used, this can have a negative
performance impact on the rate of background freeing (from "zfs
destroy").  For example, with recordsize=8k, and TXG's syncing once
every 5 seconds, we can free only 160MB of data per second, which may be
much less than the rate we can write data.

This change increases zfs_async_block_max_blocks to be unlimited by
default.  To address the dedup freeing issue, a new tunable is
introduced, zfs_max_async_dedup_frees, which limits the number of
zio_free()'s of dedup blocks done by background destroys, per txg.  The
default is 100,000 free's (same as the old zfs_async_block_max_blocks
default).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10000
2020-02-14 08:39:46 -08:00
Alexander Motin
465e4e795e
Remove duplicate dbufs accounting
Since AVL already has embedded element counter, use dn_dbufs_count
only for dbufs not counted there (bonus buffers) and just add them.
This removes two atomics per dbuf life cycle.

According to profiler it reduces time spent by dbuf_destroy() inside
bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9949
2020-02-13 11:20:42 -08:00
Christian Schwarz
948f0c4419 zcp: add zfs.sync.bookmark
Add support for bookmark creation and cloning.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:17 -08:00
Christian Schwarz
a73f361fdb Implement bookmark copying
This feature allows copying existing bookmarks using

    zfs bookmark fs#target fs#newbookmark

There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.

Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).

ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.

Overview:

- Terminology:
    - source = existing snapshot or bookmark
    - new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
  - create new bookmark node
  - copy source's `zbn_phys` to new's `zbn_phys`
  - zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
  bookmarks as sources
  - => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
  and bookmark sources
- Adjust CLI
  - refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
  - warn about redaction bookmark handling
- Add test cases
  - CLI
  - pyyzfs libzfs_core bindings

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:12 -08:00
Paul Zuchowski
bc67cba7c0
Fix zdb -R with 'b' flag
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display.  Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9640
Closes #9729
2020-02-10 14:00:05 -08:00
Ryan Moeller
57940b435c
Share some code for spa deadman tunables
We need to do the same thing to update all spas on any OS for these
tunables, so let's share the code.

While here let's match the types of the literals initializing the
variables with the type of the variable.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9964
2020-02-10 13:11:30 -08:00
Attila Fülöp
31b160f0a6
ICP: Improve AES-GCM performance
Currently SIMD accelerated AES-GCM performance is limited by two
factors:

a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.

b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.

To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.

Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9749
2020-02-10 12:59:50 -08:00
Alexander Motin
e0ce98d57c
Reduce number of atomic_add() calls in aggsum
Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to
re-borrow after the flush.  But since asc_borrowed and asc_delta are
accessed only while holding asc_lock, it makes no any sense to modify
as_lower_bound and as_upper_bound in multiple steps.  Instead of that
the new code uses only 2 atomics in all the cases, one per as_*_bound
variable.  I think even that is overkill, simple atomic store and
load could be used here, since all modifications are done under the
as_lock, but there are no such primitives in ZFS code now.

While there, make borrow code consider previous borrow value, so that
on mixed request patterns reduce chance of needing to borrow again if
much larger request follows tiny one that needed borrow.

Also reduce as_numbuckets from uint64_t to u_int.  It makes no sense
to use so large division operation on every aggsum_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9930
2020-02-06 13:21:06 -08:00
Alexander Motin
cbd8f5b759
Few microoptimizations to dbuf layer
Move db_link into the same cache line as db_blkid and db_level.
It allows significantly reduce avl_add() time in dbuf_create() on
systems with large RAM and huge number of dbufs per dnode.

Avoid few accesses to dbuf_caches[].size, which is highly congested
under high IOPS and never stays in cache for a long time.  Use local
value we are receiving from zfs_refcount_add_many() any way.

Remove cache_size_bytes_max bump from dbuf_evict_one().  I don't see
a point to do it on dbuf eviction after we done it on insertion in
dbuf_rele_and_unlock().

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9931
2020-02-05 11:08:44 -08:00
Matthew Macy
cccbed9f98
Convert dbuf dirty record record list to a list_t
Additionally pull in state machine comments about
upcoming async cow work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9902
2020-02-05 11:07:19 -08:00
Ryan Moeller
8c4987c489
Restore aclmode and remove acltype on FreeBSD
This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE,
matching what is done in the NFSv4 ACLs PR (#9709).

On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide
ZFS_PROP_ACLMODE.

The tests already assume this arrangement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9913
2020-02-04 08:40:07 -08:00
Matthew Ahrens
ec21397127
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 09:33:14 -08:00
Romain Dolbeau
35b07497c6 Add AltiVec RAID-Z
Implements the RAID-Z function using AltiVec SIMD.
This is basically the NEON code translated to AltiVec.

Note that the 'fletcher' algorithm requires 64-bits
operations, and the initial implementations of AltiVec
(PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to
32-bits operations, so no 'fletcher'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes #9539
2020-01-23 11:01:24 -08:00
Jason King
e2ef1cbf04 Support inheriting properties in channel programs
This adds support in channel programs to inherit properties analogous
to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit`
functions to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9738
2020-01-22 17:03:17 -08:00
Tom Caputi
61152d1069 Fix errata #4 handling for resuming streams
Currently, the handling for errata #4 has two issues which allow
the checks for this issue to be bypassed using resumable sends.
The first issue is that drc->drc_fromsnapobj is not set in the
resuming code as it is in the non-resuming code. This causes
dsl_crypto_recv_key_check() to skip its checks for the
from_ivset_guid. The second issue is that resumable sends do not
clean up their on-disk state if they fail the checks in
dmu_recv_stream() that happen before any data is received.

As a result of these two bugs, a user can attempt a resumable send
of a dataset without a from_ivset_guid. This will fail the initial
dmu_recv_stream() checks, leaving a valid resume state. The send
can then be resumed, which skips those checks, allowing the receive
to be completed.

This commit fixes these issues by setting drc->drc_fromsnapobj in
the resuming receive path and by ensuring that resumablereceives
are properly cleaned up if they fail the initial dmu_recv_stream()
checks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9818 
Closes #9829
2020-01-14 12:25:20 -08:00
Brian Behlendorf
e458fcca75
Change http://zfsonlinux.org links to https://zfsonlinux.org
Update the project website links contained in to repository to
reference the secure https://zfsonlinux.org address.

Reviewed-By: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9837
2020-01-13 16:43:59 -08:00
Tom Caputi
ba0ba69e50 Add 'zfs send --saved' flag
This commit adds the --saved (-S) to the 'zfs send' command.
This flag allows a user to send a partially received dataset,
which can be useful when migrating a backup server to new
hardware. This flag is compatible with resumable receives, so
even if the saved send is interrupted, it can be resumed.
The flag does not require any user / kernel ABI changes or any
new feature flags in the send stream format.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Christian Schwarz <me@cschwarz.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9007
2020-01-10 10:16:58 -08:00
Ryan Moeller
957c7aa23c Relocate common quota functions to shared code
The quota functions are common to all implementations and can be
moved to common code.  As a simplification they were moved to the
Linux platform code in the initial refactoring.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9710
2019-12-11 12:12:08 -08:00
Matthew Macy
657ce25357 Eliminate Linux specific inode usage from common code
Change many of the znops routines to take a znode rather
than an inode so that zfs_replay code can be largely shared
and in the future the much of the znops code may be shared.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9708
2019-12-11 11:53:57 -08:00
Matthew Macy
362ae8d11f Abstract away platform specific superblock references
The zfsvfs->z_sb field is Linux specified and should be abstracted.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9697
2019-12-10 09:21:07 -08:00
Matthew Macy
3c502d3b75 Exclude data from cores unconditionally and metadata conditionally
This change allows us to align the code dump logic across platforms.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9691
2019-12-09 12:29:56 -08:00
Matthew Macy
f95704ca5e Disable EDONR on FreeBSD
FreeBSD uses its own crypto framework in-kernel which, at this time,
has no EDONR implementation.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9664
2019-12-05 13:10:29 -08:00
Matthew Macy
054a049841 Add ZFS_IOC offsets for FreeBSD
FreeBSD requires three additional ioctls, they are ZFS_IOC_NEXTBOOT,
ZFS_IOC_JAIL, and ZFS_IOC_UNJAIL.  These have been added after the
Linux-specific ioctls.  The range 0x80-0xFF has been reserved for 
future optional platform-specific ioctls.  Any platform may choose
to implement these as appropriate.

None of the existing ioctl numbers have been changed to maintain
compatibility.  For Linux no vectors have been registered for the
new ioctls and they are reported as unsupported.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9667
2019-12-05 13:06:51 -08:00
Matthew Macy
e64e84eca5 Refactor deadman set failmode to be cross platform
Update zfs_deadman_failmode to use the ZFS_MODULE_PARAM_CALL
wrapper, and split the common and platform specific portions.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9670
2019-12-05 12:40:45 -08:00
Matthew Macy
be627fc847 Refactor zfs_context.h to build on FreeBSD
- on Linux move Linux specific headers to zfs_context_os.h
- on FreeBSD move FreeBSD specific definitions to zfs_context_os.h
- remove duplicate tsd_ definitions
- remove unused AT_TYPE

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9668
2019-12-04 13:12:57 -08:00
Matthew Macy
5142032106 Move zfs_cmd_t copyin/copyout to platform code
FreeBSD needs to cope with multiple version of the zfs_cmd_t
structure. Allowing the platform code to pre and post
process the cmd structure makes it possible to work with
legacy tooling.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9624
2019-12-02 10:08:27 -08:00
Matthew Macy
42a826eed3 Add FreeBSD required defines to mntent.h
Linux and FreeBSD use different names for suid / setuid.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9632
2019-11-30 15:49:09 -08:00
Matthew Macy
77323bcf53 Add FreeBSD support to zio_crypto.h
Minimal compatibility changes for FreeBSD.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9631
2019-11-30 15:38:16 -08:00
Matthew Macy
a5b762ab1d Resolve ZoF differences in zfs_ioctl.h
FreeBSD needs to be able to pass the jail id to the jail/unjail ioctls
and the struct file in the device structure is unused.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9625
2019-11-30 15:35:54 -08:00
Matthew Macy
d6f67df63c Minor diff reduction with ZoF in include/sys
- move linux/ includes to platform headers
- add void * io_bio to zio for tracking the underlying bio
- add freebsd specific fields to abd_scatter

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9615
2019-11-27 11:11:03 -08:00
jwpoduska
3c819a2c7d Prevent unnecessary resilver restarts
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #840 
Closes #9155
Closes #9378
Closes #9551
Closes #9588
2019-11-27 10:15:01 -08:00
Matthew Macy
da92d5cbb3 Add zfs_file_* interface, remove vnodes
Provide a common zfs_file_* interface which can be implemented on all 
platforms to perform normal file access from either the kernel module
or the libzpool library.

This allows all non-portable vnode_t usage in the common code to be 
replaced by the new portable zfs_file_t.  The associated vnode and
kobj compatibility functions, types, and macros have been removed
from the SPL.  Moving forward, vnodes should only be used in platform
specific code when provided by the native operating system.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9556
2019-11-21 09:32:57 -08:00
Michael Niewöhner
6d948c3519 Add kmem_cache flag for forcing kvmalloc
This adds a new KMC_KVMEM flag was added to enforce use of the
kvmalloc allocator in kmem_cache_create even for large blocks, which
may also increase performance in some specific cases (e.g. zstd), too.

Default to KVMEM instead of VMEM in spl_kmem_cache_create.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9034
2019-11-13 10:05:23 -08:00
Brian Behlendorf
066e825221
Linux compat: Minimum kernel version 3.10
Increase the minimum supported kernel version from 2.6.32 to 3.10.
This removes support for the following Linux enterprise distributions.

    Distribution     | Kernel | End of Life
    ---------------- | ------ | -------------
    Ubuntu 12.04 LTS | 3.2    | Apr 28, 2017
    SLES 11          | 3.0    | Mar 32, 2019
    RHEL / CentOS 6  | 2.6.32 | Nov 30, 2020

The following changes were made as part of removing support.

* Updated `configure` to enforce a minimum kernel version as
  specified in the META file (Linux-Minimum: 3.10).

    configure: error:
        *** Cannot build against kernel version 2.6.32.
        *** The minimum supported kernel version is 3.10.

* Removed all `configure` kABI checks and matching C code for
  interfaces which solely predate the Linux 3.10 kernel.

* Updated all `configure` kABI checks to fail when an interface is
  missing which was in the 3.10 kernel up to the latest 5.1 kernel.
  Removed the HAVE_* preprocessor defines for these checks and
  updated the code to unconditionally use the verified interface.

* Inverted the detection logic in several kABI checks to match
  the new interface as it appears in 3.10 and newer and not the
  legacy interface.

* Consolidated the following checks in to individual files. Due
  the large number of changes in the checks it made sense to handle
  this now.  It would be desirable to group other related checks in
  the same fashion, but this as left as future work.

  - config/kernel-blkdev.m4 - Block device kABI checks
  - config/kernel-blk-queue.m4 - Block queue kABI checks
  - config/kernel-bio.m4 - Bio interface kABI checks

* Removed the kABI checks for sops->nr_cached_objects() and
  sops->free_cached_objects().  These interfaces are currently unused.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9566
2019-11-12 08:59:06 -08:00
Matthew Macy
27ece2ee4d Move platform specific parts of zfs_znode.h to platform code
Some of the znode fields are different and functions
consuming an inode don't exist on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9536
2019-11-06 10:54:25 -08:00
Prakash Surya
e5d1c27e30 Enable use of DTRACE_PROBE* macros in "spl" module
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.

Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.

Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.

Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9525
2019-11-01 13:13:43 -07:00
Matthew Macy
4a2ed90013 Wrap Linux module macros
MODULE_VERSION is already defined on FreeBSD. Wrap all of the
used MODULE_* macros for the sake of consistency and portability.

Add a user space noop version to reduce the need for _KERNEL ifdefs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9542
2019-11-01 10:41:03 -07:00
Matthew Macy
bd4dde8ef7 Prefix struct rangelock
A struct rangelock already exists on FreeBSD.  Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
This change is a follow up to 2cc479d0.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9534
2019-11-01 10:37:33 -07:00
Matthew Macy
156f74fc03 Return an error code from zfs_acl_chmod_setattr
The FreeBSD implementation can fail, allow this function to
fail and add the required error handling for Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9541
2019-11-01 10:19:11 -07:00
Matthew Macy
c4ae27c763 Move sha2.h to platform code
FreeBSD has its own sha routines that the port uses.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9530
2019-10-31 15:45:58 -07:00
Matthew Macy
a5308e252d Don't cast away const
Follow up to 511fce6b which missed a cast.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9533
2019-10-31 10:38:03 -07:00
Matthew Macy
2a3aa5a109 Factor Linux specific code out of spa_misc.c
Move these Linux module parameter get/set helpers in to
platform specific code.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9457
2019-10-31 09:52:22 -07:00
Matthew Macy
4a22ba5be0 Minor spa portability fixes
- FreeBSD's rootpool import code uses spa_config_parse
- Move the zvol_create_minors call out from under the
  spa_namespace_lock in spa_import. It isn't needed and it causes
  a lock order reversal on FreeBSD.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9499
2019-10-28 09:51:53 -07:00
loli10K
e35704647e Fix for ARC sysctls ignored at runtime
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.

A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9487  
Closes #9489
2019-10-26 15:22:19 -07:00
Matthew Macy
1952fe0e25 Move platform dependent errno aliases
EBADE, EBADR, and ENOANO do not exist on FreeBSD

The libspl errno.h is similarly platform dependent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9498
2019-10-25 13:40:50 -07:00
Matthew Macy
68a1b1589a Remove sdt.h
It's mostly a noop on ZoL and it conflicts with platforms that 
support dtrace.  Remove this header to resolve the conflict.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9497
2019-10-25 13:38:37 -07:00
Brian Behlendorf
10fa254539
Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9346
Closes #9403
2019-10-24 10:17:33 -07:00
Matthew Macy
c9c9c1e213 OpenZFS restructuring - ARC memory pressure
Factor Linux specific memory pressure handling out of ARC.  Each
platform will have different available interfaces for managing memory
pressure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9472
2019-10-18 13:23:19 -04:00
Matthew Macy
08f530c699 Make zfsdev_getminor signature cross platform
Only pass the file descriptor to make zfsdev_get_miror() portable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9466
2019-10-16 18:43:52 -07:00
Paul Dagnelie
511fce6b1f Don't call sizeof on void
We get the sizeof the appropriate type, and don't cast away const.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9455
2019-10-13 19:21:51 -07:00
Paul Dagnelie
516a83f886 Function name and comment updates
Rename certain functions for more consistency when they share common 
features. Make comments clearer about what arguments should be passed 
to the insert and add functions.

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9441
2019-10-11 10:13:21 -07:00
Matthew Macy
af1698f59b Expose dmu_buf_hold_array_by_dnode to platform code
FreeBSD uses this in its pager ops routines

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9431
2019-10-11 10:06:18 -07:00
Matthew Macy
eedb3a62b9 Make zil_async_to_sync visible to platform code
FreeBSD's zvol platform code requires access to the
zil_async_to_sync() function.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9440
2019-10-10 15:39:44 -07:00
Matthew Macy
e4f5fa1229 Fix strdup conflict on other platforms
In the FreeBSD kernel the strdup signature is:

```
char	*strdup(const char *__restrict, struct malloc_type *);
```

It's unfortunate that the developers have chosen to change
the signature of libc functions - but it's what I have to
deal with.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9433
2019-10-10 09:47:06 -07:00
Paul Dagnelie
ca5777793e Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to 
store range trees more efficiently.

The new structure is approximately a B-tree, though there are some 
small differences from the usual characterizations. The tree has core 
nodes and leaf nodes; each contain data elements, which the elements 
in the core nodes acting as separators between its children. The 
difference between core and leaf nodes is that the core nodes have an 
array of children, while leaf nodes don't. Every node in the tree may 
be only partially full; in most cases, they are all at least 50% full 
(in terms of element count) except for the root node, which can be 
less full. Underfull nodes will steal from their neighbors or merge to 
remain full enough, while overfull nodes will split in two. The data 
elements are contained in tree-controlled buffers; they are copied 
into these on insertion, and overwritten on deletion. This means that 
the elements are not independently allocated, which reduces overhead, 
but also means they can't be shared between trees (and also that 
pointers to them are only valid until a side-effectful tree operation 
occurs). The overhead varies based on how dense the tree is, but is 
usually on the order of about 50% of the element size; the per-node 
overheads are very small, and so don't make a significant difference. 
The trees can accept arbitrary records; they accept a size and a 
comparator to allow them to be used for a variety of purposes.

The new trees replace the AVL trees used in the range trees today. 
Currently, the range_seg_t structure contains three 8 byte integers 
of payload and two 24 byte avl_tree_node_ts to handle its storage in 
both an offset-sorted tree and a size-sorted tree (total size: 64 
bytes). In the new model, the range seg structures are usually two 4 
byte integers, but a separate one needs to exist for the size-sorted 
and offset-sorted tree. Between the raw size, the 50% overhead, and 
the double storage, the new btrees are expected to use 8*1.5*2 = 24 
bytes per record, or 33.3% as much memory as the AVL trees (this is 
for the purposes of storing metaslab range trees; for other purposes, 
like scrubs, they use ~50% as much memory).

We reduced the size of the payload in the range segments by teaching 
range trees about starting offsets and shifts; since metaslabs have a 
fixed starting offset, and they all operate in terms of disk sectors, 
we can store the ranges using 4-byte integers as long as the size of 
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle 
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not 
anticipate disks of this size in the near future, there should be 
almost no cases where metaslabs need 64-byte integers to store their 
ranges. We do still have the capability to store 64-byte integer ranges 
to account for cases where we are storing per-vdev (or per-dnode) trees, 
which could reasonably go above the limits discussed. We also do not 
store fill information in the compact version of the node, since it 
is only used for sorted scrub.

We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than 
they are for the avl tree, remove usually requires a find operation, 
while in the AVL tree model the element itself suffices. Some clever 
changes actually caused an overall speedup in metaslab loading; we use 
approximately 40% less cpu to load metaslabs in our tests on Illumos.

Another memory and performance optimization was achieved by changing 
what is stored in the size-sorted trees. When a disk is heavily 
fragmented, the df algorithm used by default in ZFS will almost always 
find a number of small regions in its initial cursor-based search; it 
will usually only fall back to the size-sorted tree to find larger 
regions. If we increase the size of the cursor-based search slightly, 
and don't store segments that are smaller than a tunable size floor 
in the size-sorted tree, we can further cut memory usage down to 
below 20% of what the AVL trees store. This also results in further 
reductions in CPU time spent loading metaslabs.

The 16KiB size floor was chosen because it results in substantial memory 
usage reduction while not usually resulting in situations where we can't 
find an appropriate chunk with the cursor and are forced to use an 
oversized chunk from the size-sorted tree. In addition, even if we do 
have to use an oversized chunk from the size-sorted tree, the chunk 
would be too small to use for ZIL allocations, so it isn't as big of a 
loss as it might otherwise be. And often, more small allocations will 
follow the initial one, and the cursor search will now find the 
remainder of the chunk we didn't use all of and use it for subsequent 
allocations. Practical testing has shown little or no change in 
fragmentation as a result of this change.

If the size-sorted tree becomes empty while the offset sorted one still 
has entries, it will load all the entries from the offset sorted tree 
and disregard the size floor until it is unloaded again. This operation 
occurs rarely with the default setting, only on incredibly thoroughly 
fragmented pools.

There are some other small changes to zdb to teach it to handle btrees, 
but nothing major.
                                           
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 10:36:03 -07:00
Matthew Macy
2cc479d049 Rename rangelock_ functions to zfs_rangelock_
A rangelock KPI already exists on FreeBSD.  Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9402
2019-10-03 15:54:29 -07:00
Matthew Macy
6360e2779e Add inode accessors to common code
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9389
2019-10-02 09:15:12 -07:00
Matthew Macy
13a4027a7c OpenZFS restructuring - arc_stats
Make arc_stats visible to platform code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9386
2019-10-01 16:35:05 -07:00
Matthew Macy
7bb0c29468 OpenZFS restructuring - zfs_ioctl
Refactor the zfs ioctls in to platform dependent and independent bits.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9301
2019-09-27 10:46:28 -07:00
Matthew Macy
5df7e9d85c OpenZFS restructuring - zvol
Refactor the zvol in to platform dependent and independent bits.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9295
2019-09-25 09:20:30 -07:00
John Gallagher
e60e158eff Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.

This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:

 - Scrubs or resilvers to complete
 - Devices to initialized
 - Devices to be replaced
 - Devices to be removed
 - Checkpoints to be discarded
 - Background freeing to complete

For example, a scrub that is in progress could be waited for by running

    zpool wait -t scrub <pool>

This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.

This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.

Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:

 - Added ZoL-style ioctl input declaration.
 - Reorganized error handling in zpool_initialize in libzfs to integrate
   better with changes made for TRIM support.
 - Fixed check for whether a checkpoint discard is in progress.
   Previously it also waited if the pool had a checkpoint, instead of
   just if a checkpoint was being discarded.
 - Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
 - Updated more existing tests to make use of new 'zpool wait'
   functionality, tests that don't exist in Delphix OS.
 - Used existing ZoL tunable zfs_scan_suspend_progress, together with
   zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
 - Added support for a non-integral interval argument to zpool wait.

Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-13 18:09:06 -07:00
Chengfei ZHu
7238cbd4d3 QAT related bug fixes
1. Fix issue:  Kernel BUG with QAT during decompression  #9276.
   Now it is uninterruptible for a specific given QAT request,
   but Ctrl-C interrupt still works in user-space process.

2. Copy the digest result to the buffer only when doing encryption,
   and vise-versa for decryption.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com>
Closes #9276 
Closes #9303
2019-09-12 13:33:44 -07:00
Matthew Macy
74756182d2 Enable compiler to typecheck logging
Annotate spa logging declarations with printflike
Workaround gcc bug (non disable-able warning) by
replacing "" with " "

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9316
2019-09-12 13:28:26 -07:00
Matthew Macy
d66620681d OpenZFS restructuring - move linux tracing code to platform directories
Move Linux specific tracing headers and source to platform directories
and update the build system.

Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9290
2019-09-11 14:25:53 -07:00
Brian Behlendorf
25f06d677a
Fix /etc/hostid on root pool deadlock
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify().  A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool.  This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer.  For example, to perform a
`zpool attach` as shown in the abbreviated stack below.

To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property.  The cached
value is then relied upon for subsequent accesses.

Call Trace:
    spa_config_enter+0x1e8/0x350 [zfs]
    zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
    zio_read+0x6c/0x140 [zfs]
    ...
    vfs_read+0xfc/0x1e0
    kernel_read+0x50/0x90
    ...
    spa_get_hostid+0x1c/0x38 [zfs]
    spa_config_generate+0x1a0/0x610 [zfs]
    vdev_label_init+0xa0/0xc80 [zfs]
    vdev_create+0x98/0xe0 [zfs]
    spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9256 
Closes #9285
2019-09-10 13:42:30 -07:00
Brian Behlendorf
b88ca2acf5
Enable SIMD for encryption
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context.  This was done
intentionally since it's a relatively infrequent operation.  However,
this also meant that the encryption context templates were initialized
using the generic operations.  Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.

Resolve the issue by initializing the context templates in an I/O
pipeline thread.  And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context.  For example, when performing a read from the ARC.

Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9215
Closes #9296
2019-09-10 10:45:46 -07:00
Matthew Macy
bced7e3aaa OpenZFS restructuring - move platform specific sources
Move platform specific Linux source under module/os/linux/
and update the build system accordingly.  Additional code
restructuring will follow to make the common code fully
portable.
    
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Closes #9206
2019-09-06 11:26:26 -07:00
Matthew Macy
03fdcb9adc Make module tunables cross platform
Adds ZFS_MODULE_PARAM to abstract module parameter
setting to operating systems other than Linux.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9230
2019-09-05 14:49:49 -07:00
Matthew Macy
006e9a4088 OpenZFS restructuring - move platform specific headers
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.

As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context.  No functional change, the source has been
refactored and the relevant #include's updated.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9198
2019-09-05 09:34:54 -07:00
Andrea Gelmini
cf7c5a030e Fix typos in include/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9238
2019-08-30 09:53:15 -07:00
Paul Dagnelie
eef0f4d84e Keep more metaslabs loaded
With the other metaslab changes loaded onto a system, we can 
significantly reduce the memory usage of each loaded metaslab and 
unload them on demand if there is memory pressure. However, none 
of those changes actually result in us keeping more metaslabs loaded. 
If we don't keep more metaslabs loaded, we will still have to wait 
for demand-loading to finish when no loaded metaslab can satisfy our 
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while 
unloading is currently done on a txg-count basis. Txgs can take a 
widely varying range of times, from tenths of a second to several 
seconds. This can result in confusing, hard to predict behavior.

This change simply adds a time-based component to metaslab unloading. 
A metaslab will remain loaded for one minute and 8 txgs (by default) 
after it was last used, unless it is evicted due to memory pressure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-65016
External-issue: DLPX-65047
Closes #9197
2019-08-29 10:20:36 -07:00
Chunwei Chen
035e96118b Fix zil replay panic when TX_REMOVE followed by TX_CREATE
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
2019-08-28 10:42:02 -07:00
Tom Caputi
e7a2fa70c3 Fix deadlock in 'zfs rollback'
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.

This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9203
2019-08-27 09:55:51 -07:00
Matthew Ahrens
325d288c5d Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing.  With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.

By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes #8744
2019-08-20 11:34:52 -07:00
Paul Dagnelie
f09fda5071 Cap metaslab memory usage
On systems with large amounts of storage and high fragmentation, a huge 
amount of space can be used by storing metaslab range trees. Since 
metaslabs are only unloaded during a txg sync, and only if they have 
been inactive for 8 txgs, it is possible to get into a state where all 
of the system's memory is consumed by range trees and metaslabs, and 
txgs cannot sync. While ZFS knows how to evict ARC data when needed, 
it has no such mechanism for range tree data. This can result in boot 
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing 
context. Second, we store a multilist of all loaded metaslabs, sorted 
by their selection txg, so we can quickly identify the oldest 
metaslabs.  We use a multilist to reduce lock contention during heavy 
write workloads. Finally, we add logic that will unload a metaslab 
when we're loading a new metaslab, if we're using more than a certain 
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9128
2019-08-16 09:08:21 -06:00
Paul Dagnelie
dc04a8c757 Prevent race in blkptr_verify against device removal
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run 
into a race condition where that vdev is temporarily unavailable. This 
happens when a device removal operation and the old vdev_t has been 
removed from the array, but the new indirect vdev has not yet been 
inserted.

We hold the spa_config_lock while doing our sensitive verification. 
To ensure that we don't deadlock, we only grab the lock if we don't 
have config_writer held. In addition, I had to const the tags of the 
refcounts and the spa_config_lock arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9112
2019-08-13 21:24:43 -06:00
Chunwei Chen
8e556c5ebc Fix out-of-order ZIL txtype lost on hardlinked files
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
2019-08-13 21:21:27 -06:00
George Wilson
c8242a96ba spa_load_verify() may consume too much memory
When a pool is imported it will scan the pool to verify the integrity 
of the data and metadata. The amount it scans will depend on the 
import flags provided. On systems with small amounts of memory or 
when importing a pool from the crash kernel, it's possible for 
spa_load_verify to issue too many I/Os that it consumes all the memory 
of the system resulting in an OOM message or a hang.

To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
2019-08-13 08:11:57 -06:00
Tomohiro Kusumi
a43570c5f3 Change boolean-like uint8_t fields in znode_t to boolean_t
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9092
2019-08-13 07:58:02 -06:00
Paul Dagnelie
c81f1790e2 Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
2019-08-05 14:34:27 -07:00
Sara Hartse
37f03da8ba Fast Clone Deletion
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.

This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.

We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8416
2019-07-26 10:54:14 -07:00
Matthew Ahrens
1ff46825e2 Replace zf_rwlock with a mutex
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex.  Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9062
2019-07-25 11:57:58 -07:00
Serapheim Dimitropoulos
43a8536260 Race condition between spa async threads and export
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9015 
Closes #9044
2019-07-18 13:02:33 -07:00
jdike
a649768a17 Fix lockdep recursive locking false positive in dbuf_destroy
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #8984
2019-07-17 09:18:24 -07:00
Serapheim Dimitropoulos
93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00
Tomohiro Kusumi
ff9630d1a8 Disable unused pathname::pn_path* (unneeded in Linux)
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.

Saves stack a bit on lookup and unlink.

(#if0'd members instead of removing since comments refer to them.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9025
2019-07-15 13:57:56 -07:00
Brian Behlendorf
e5db313494
Linux 5.0 compat: SIMD compatibility
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels.  This is accomplished by leveraging
the fact that by definition dedicated kernel threads never need
to concern themselves with saving and restoring the user FPU state.
Therefore, they may use the FPU as long as we can guarantee user
tasks always restore their FPU state before context switching back
to user space.

For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used.  All non-kernel
threads will restore the preserved user FPU state.

For 5.2 and latter kernels the user FPU state restoration will be
skipped if the kernel determines the registers have not changed.
Therefore, for these kernels we need to perform the additional
step of saving and restoring the FPU registers.  Invalidating the
per-cpu global tracking the FPU state would force a restore but
that functionality is private to the core x86 FPU implementation
and unavailable.

In practice, restricting SIMD to kernel threads is not a major
restriction for ZFS.  The vast majority of SIMD operations are
already performed by the IO pipeline.  The remaining cases are
relatively infrequent and can be handled by the generic code
without significant impact.  The two most noteworthy cases are:

  1) Decrypting the wrapping key for an encrypted dataset,
     i.e. `zfs load-key`.  All other encryption and decryption
     operations will use the SIMD optimized implementations.

  2) Generating the payload checksums for a `zfs send` stream.

In order to avoid making any changes to the higher layers of ZFS
all of the `*_get_ops()` functions were updated to take in to
consideration the calling context.  This allows for the fastest
implementation to be used as appropriate (see kfpu_allowed()).

The only other notable instance of SIMD operations being used
outside a kernel thread was at module load time.  This code
was moved in to a taskq in order to accommodate the new kernel
thread restriction.

Finally, a few other modifications were made in order to further
harden this code and facilitate testing.  They include updating
each implementations operations structure to be declared as a
constant.  And allowing "cycle" to be set when selecting the
preferred ops in the kernel as well as user space.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8754 
Closes #8793 
Closes #8965
2019-07-12 09:31:20 -07:00
Paul Dagnelie
f664f1ee7f Decrease contention on dn_struct_rwlock
Currently, sequential async write workloads spend a lot of time 
contending on the dn_struct_rwlock. This lock is responsible for 
protecting the entire block tree below it; this naturally results 
in some serialization during heavy write workloads. This can be 
resolved by having per-dbuf locking, which will allow multiple 
writers in the same object at the same time.

We introduce a new rwlock, the db_rwlock. This lock is responsible 
for protecting the contents of the dbuf that it is a part of; when 
reading a block pointer from a dbuf, you hold the lock as a reader. 
When writing data to a dbuf, you hold it as a writer. This allows 
multiple threads to write to different parts of a file at the same 
time.

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens matt@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-52564
External-issue: DLPX-53085
External-issue: DLPX-57384
Closes #8946
2019-07-08 13:18:50 -07:00
Brad Lewis
cb70964221 8659 static dtrace probes unavailable on non-GPL modules
ZFS tracing efforts are hampered by the inability to access zfs static
probes(probes using DTRACE_PROBE macros). The probes are available via
tracepoints for GPL modules only.  The build could be modified to
generate a function for each unique DTRACE_PROBE invocation. These could
be then accessed via kprobes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>
Closes #8659 
Closes #8663
2019-07-08 11:20:53 -07:00
loli10K
3b5fe2c351 Fix zfs "redact" misc issues
* zfs redact error messages do not end with newline character
 * 30af21b0 inadvertently removed some ZFS_PROP comments
 * man/zfs: zfs redact <redaction_snapshot> is not optional

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8988
2019-07-05 16:38:17 -07:00
Alexander Motin
fc7546777b Avoid extra taskq_dispatch() calls by DMU
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8909
2019-06-25 12:03:38 -07:00
loli10K
746d4a451e Fix bp_embedded_type enum definition
With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8951
2019-06-24 18:02:17 -07:00
Matthew Ahrens
59ec30a329 Remove code for zfs remap
The "zfs remap" command was disabled by
6e91a72fe3, because it has little utility
and introduced some tricky bugs.  This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.

Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8944
2019-06-24 16:44:01 -07:00
Don Brady
186898bbb5 OpenZFS 9425 - channel programs can be interrupted
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.

Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete.  Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.

Porting notes: Added missing return value from cv_wait_sig()

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926
Closes #8904
2019-06-22 16:51:46 -07:00
Paul Dagnelie
2b09628b59 Fix comments on zfs_bookmark_phys
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8945
2019-06-22 16:32:26 -07:00
Tom Caputi
da68988708 Allow unencrypted children of encrypted datasets
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.

Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8737 
Closes #8870
2019-06-20 12:29:51 -07:00
Matthew Ahrens
050d720c43 Remove dedupditto functionality
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x).  The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.

ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ).  Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful.  The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad.  This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.

Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.

Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).

In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection.  It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).

We should remove the dedupditto functionality.  For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect.  For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate.  However, ZFS won't write any new dedupditto
blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Issue #8270 
Closes #8310
2019-06-19 14:54:02 -07:00
Paul Dagnelie
30af21b025 Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to 
a target system. One possible use case for this feature is to not 
transmit sensitive information to a data warehousing, test/dev, or 
analytics environment. Another is to save space by not replicating 
unimportant data within a given dataset, for example in backup tools 
like zrepl.

Redacted send/receive is a three-stage process. First, a clone (or 
clones) is made of the snapshot to be sent to the target. In this 
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction 
snapshot" (or snapshots). Second, the new zfs redact command is used 
to create a redaction bookmark. The redaction bookmark stores the 
list of blocks in a snapshot that were modified by the redaction 
snapshot(s). Finally, the redaction bookmark is passed as a parameter 
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive 
or unwanted information, and those blocks are not included in the send 
stream.  When sending from the redaction bookmark, the blocks it 
contains are considered as candidate blocks in addition to those 
blocks in the destination snapshot that were modified since the 
creation_txg of the redaction bookmark.  This step is necessary to 
allow the target to rehydrate data in the case where some blocks are 
accidentally or unnecessarily modified in the redaction snapshot.

The changes to bookmarks to enable fast space estimation involve 
adding deadlists to bookmarks. There is also logic to manage the 
life cycles of these deadlists.

The new size estimation process operates in cases where previously 
an accurate estimate could not be provided. In those cases, a send 
is performed where no data blocks are read, reducing the runtime 
significantly and providing a byte-accurate size estimate.

Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 09:48:12 -07:00
Matthew Ahrens
7218b29e4b lz4_decompress_abd declared but not defined
`lz4_decompress_abd` is declared in zio_compress.h but it is not defined
anywhere. The declaration should be removed.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-47477
Closes #8894
2019-06-13 13:14:34 -07:00
Matthew Ahrens
53dce5acc6 panic in removal_remap test on 4K devices
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.

On debug bits this assertion will fail in spa_vdev_copy_segment():
    ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);

On nondebug, the system panics with a stack like:
    metaslab_free_concrete()
    metaslab_free_impl()
    metaslab_free_impl_cb()
    vdev_indirect_remap()
    free_from_removing_vdev()
    metaslab_free_impl()
    metaslab_free_dva()
    metaslab_free()

Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default.  We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.

This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes #8893
2019-06-13 13:12:39 -07:00
Matthew Ahrens
d9b4bf0665 fat zap should prefetch when iterating
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.

For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes #8862
2019-06-12 13:13:09 -07:00
Matthew Ahrens
5662fd5794 single-chunk scatter ABDs can be treated as linear
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8580
2019-06-11 09:02:31 -07:00
Matthew Ahrens
b8738257c2 make zil max block size tunable
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations.  The large allocations are for ZIL blocks.  If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more.  In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
2019-06-10 11:48:42 -07:00
Paul Dagnelie
893a6d62c1 Allow metaslab to be unloaded even when not freed from
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure, 
on heavily fragmented pools they can still consume a significant 
amount of memory. This problem is amplified when we fail to unload 
metaslabs that we aren't using. Currently, we only unload a metaslab 
during metaslab_sync_done; in order for that function to be called 
on a given metaslab in a given txg, we have to have dirtied that 
metaslab in that txg. If the dirtying was the result of an allocation, 
we wouldn't be unloading it (since it wouldn't be 8 txgs since it 
was selected), so in effect we only unload a metaslab during txgs 
where it's being freed from.

We move the unload logic from sync_done to a new function, and 
call that function on all metaslabs in a given vdev during 
vdev_sync_done().

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8837
2019-06-06 19:10:43 -07:00
Olaf Faaland
ca95f70dff zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time.  The user may
not know why, or be able to tell whether the import is progressing or is
hung.

Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one).  The kstat is /proc/spl/kstat/zfs/import_progress.

The kstat contents are as follows:
pool_guid         load_state multihost_secs  max_txg pool_name
16667015954387398 3          15              0       tank3

load_state: the value of spa_load_state
multihost_secs:  seconds until the end of the multihost activity
                 check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring

This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting.  The
zpool import subcommand could also be modified to report this
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 10:08:05 -07:00
Brian Behlendorf
caf9dd209f
Fix send/recv lost spill block
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object.  It may need to be removed or kept depending on how the
object was modified at the source.

This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties.  This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.

What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose.  When set then
the object references a spill block and it must be kept.  Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record.  Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.

Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible.  Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed.  Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.

The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS.  Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8668
2019-05-07 15:18:44 -07:00
Tomohiro Kusumi
9c53e51616 Fix zfs set atime|relatime=off|on behavior on inherited datasets
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.

(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)

--
 # zpool create p1 <device>
 # zfs create p1/f1
 # zfs set atime=off p1
 # echo test > /p1/f1/test
 # sync
 # zfs list
 NAME    USED  AVAIL     REFER  MOUNTPOINT
 p1      176K  18.9G     25.5K  /p1
 p1/f1    26K  18.9G       26K  /p1/f1
 # zfs get atime
 NAME   PROPERTY  VALUE  SOURCE
 p1     atime     off    local
 p1/f1  atime     off    inherited from p1
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:33.741205192 +0900
 # cat /p1/f1/test
 test
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:50.173231861 +0900
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--

The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.

Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.

Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.

The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8674 
Closes #8675
2019-05-07 10:06:30 -07:00
Matthew Ahrens
6bdefad311 Remove incorrect (and inappropriate) comment in dprintf_dnode
This comment seems to misunderstand the ## preprocessor token, which
does token concatenation.  It is not needed here, since we are
concatenating string literals, which is performed by putting the
literals next to each other.

Additionally, the comment uses offensive language.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8698
Closes #8699
2019-05-01 17:32:54 -07:00
Richard Laager
8dda07b33c Reference zfeature.c in a SPA_VERSION comment
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:19 -07:00
Richard Laager
7698c4eca9 Remove zfs.h comments about GRUB
Nobody is going to be bumping SPA_VERSION again, as OpenZFS has moved on
to feature flags.  Also, there is no requirement to keep GRUB
up-to-date, nor has that been happening.

The ZPL_VERSION could be bumped, but that would likely be handled in a
similar way, by adding filesystem feature flags.  In any event, we do
not need this comment, and we certainly don't need a reference to the
GRUB 0.97 source code in a Solaris tree.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:14 -07:00
Tomohiro Kusumi
96e51d2773 Sync reserved Illumos ioctl comment with actual number
It's 81 now.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8598
2019-04-14 11:12:07 -07:00
Sara Hartse
a887d653b3 Restrict kstats and print real pointers
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8467 
Closes #8476
2019-04-04 18:57:06 -07:00
Brian Behlendorf
1b939560be
Add TRIM support
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
2019-03-29 09:13:20 -07:00
Olaf Faaland
060f0226e6 MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 12:47:57 -07:00
Tom Caputi
ab7615d92c Multiple DVA Scrubbing Fix
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.

This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.

To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8453
2019-03-15 14:14:31 -07:00
Tom Caputi
f00ab3f22c Detect and prevent mixed raw and non-raw sends
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.

This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.

This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8308
2019-03-13 11:00:43 -07:00
Tom Caputi
579ce7c5ae Add bookmark v2 on-disk feature
This patch adds the bookmark v2 feature to the on-disk format. This
feature will be needed for the upcoming redacted sends and for an
upcoming fix that for raw receives. The feature is not currently
used by any code and thus this change is a no-op, aside from the
fact that the user can now enable the feature.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8308
2019-03-13 10:58:39 -07:00
Tom Caputi
369aa501d1 Fix handling of maxblkid for raw sends
Currently, the receive code can create an unreadable dataset from
a correct raw send stream. This is because it is currently
impossible to set maxblkid to a lower value without freeing the
associated object. This means truncating files on the send side
to a non-0 size could result in corruption. This patch solves this
issue by adding a new 'force' flag to dnode_new_blkid() which will
allow the raw receive code to force the DMU to accept the provided
maxblkid even if it is a lower value than the existing one.

For testing purposes the send_encrypted_files.ksh test has been
extended to include a variety of truncated files and multiple
snapshots. It also now leverages the xattrtest command to help
ensure raw receives correctly handle xattrs.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8168 
Closes #8487
2019-03-13 10:52:01 -07:00
Olaf Faaland
db2af93d72 Increase default zfs_multihost_fail_intervals and import_intervals
By default, when multihost is enabled for a pool, the pool is
suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write.  This is the recommended
configuration.

The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.

There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds.  This
in turn caused MMP to suspend the pool.

In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O.  The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.

Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.

Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor.  This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7709 
Closes #8495
2019-03-13 09:50:48 -07:00
Alek P
4c0883fb4a Avoid retrieving unused snapshot props
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077
2019-03-12 13:13:22 -07:00
Olaf Faaland
3d31aad83e MMP writes rotate over leaves
Instead of choosing a leaf vdev quasi-randomly, by starting at the root
vdev and randomly choosing children, rotate over leaves to issue MMP
writes.  This fixes an issue in a pool whose top-level vdevs have
different numbers of leaves.

The issue is that the frequency at which individual leaves are chosen
for MMP writes is based not on the total number of leaves but based on
how many siblings the leaves have.

For example, in a pool like this:

       root-vdev
   +------+---------------+
vdev1                   vdev2
  |                       |
  |                +------+-----+-----+----+
disk1             disk2 disk3 disk4 disk5 disk6

vdev1 and vdev2 will each be chosen 50% of the time.  Every time vdev1
is chosen, disk1 will be chosen.  However, every time vdev2 is chosen,
disk2 is chosen 20% of the time.  As a result, disk1 will be sent 5x as
many MMP writes as disk2.

This may create wear issues in the case of SSDs.  It also reduces the
effectiveness of MMP as it depends on the writes being evenly
distributed for the case where some devices fail or are partitioned.

The new code maintains a list of leaf vdevs in the pool.  MMP records
the last leaf used for an MMP write in mmp->mmp_last_leaf.  To choose
the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list,
continuing from the head if the tail is reached.  It stops when a
suitable leaf is found or all leaves have been examined.

Added a test to verify MMP write distribution is even.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7953
2019-03-12 10:37:06 -07:00
Lorenz Brun
bf90948daf Reorder ZFS ioctls to fix cross-version compatibility
Reorder ZFS ioctls to fix cross-version compatibility.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Lorenz Brun <lorenz@dolansoft.org>
Closes #8484
2019-03-09 13:39:31 -08:00
Paul Zuchowski
a73e8fdb93 Stack overflow in recursive bpobj_iterate_impl
The function bpobj_iterate_impl overflows the stack when bpobjs
are deeply nested. Rewrite the function to eliminate the recursion.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7674
Closes #7675 
Closes #7908
2019-03-06 09:50:55 -08:00
lidongyang
8d9e51c084 Fix dnode_hold_impl() soft lockup
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes #8433
2019-02-22 09:48:37 -08:00
Serapheim Dimitropoulos
928e8ad47d Introduce auxiliary metaslab histograms
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.

The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram.  The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.

In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8358
2019-02-20 09:59:56 -08:00
Paul Zuchowski
9c5e88b1de zfs should optionally send holds
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7513
2019-02-15 12:41:38 -08:00
Alek P
dcec0a12c8 port async unlinked drain from illumos-nexenta
This patch is an async implementation of the existing sync
zfs_unlinked_drain() function. This function is called at mount time and
is responsible for freeing znodes that we didn't get to freeing before.
We don't have to hold mounting of the dataset until the unlinked list is
fully drained as is done now. Since we can process the unlinked set
asynchronously this results in a better user experience when mounting a
dataset with entries in the unlinked set.

Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8142
2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos
425d3237ee Get rid of space_map_update() for ms_synced_length
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8328
2019-02-12 10:38:11 -08:00
loli10K
d8d418ff0c ZVOLs should not be allowed to have children
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
2019-02-08 15:44:15 -08:00
Serapheim Dimitropoulos
df72b8bebe Rename range_tree_verify to range_tree_verify_not_present
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8327
2019-01-25 09:51:24 -08:00
Serapheim Dimitropoulos
b194fab0fb Factor metaslab_load_wait() in metaslab_load()
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.

Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8290
2019-01-18 11:10:32 -08:00
Serapheim Dimitropoulos
1a759200e5 Document guidelines for usage of zfs_dbgmsg
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8299
2019-01-18 10:16:56 -08:00
Tom Caputi
305781da4b Fix error handling incallers of dbuf_hold_level()
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8291
2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos
75058f3303 Remove unused vdev_t fields
The following fields from the vdev_t struct are not used anywhere.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8285
2019-01-17 15:41:12 -08:00
Serapheim Dimitropoulos
61c3391acc Serialize ZTHR operations to eliminate races
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8229
2019-01-13 10:09:46 -08:00
Brian Behlendorf
6955b40138
Provide more flexible object allocation interface
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.

The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.

The zap_create_* functions have been updated to take advantage of
this new functionality.  The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().

The new symbols have been exported for use by Lustre.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8015
2019-01-10 14:37:43 -08:00
George Wilson
c10d37dd9f zfs initialize performance enhancements
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8230
2019-01-07 11:03:08 -08:00
George Wilson
619f097693 OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2019-01-07 10:37:26 -08:00
Brad Lewis
3ec34e5527 OpenZFS 9284 - arc_reclaim_thread has 2 jobs
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping.  However, the code is still confusing and error-prone,
because this thread has two responsibilities.  We should instead
separate this into two threads each with their own responsibility:

 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

 2. keep enough free memory in the system, by calling
    `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
    `arc_available_memory()`.

Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by:  Brad Lewis <brad.lewis@delphix.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9
Closes #8165
2018-12-26 13:22:28 -08:00
Prakash Surya
900d09b285 OpenZFS 9962 - zil_commit should omit cache thrash
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.

Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).

Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.

The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).

For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.

Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.

This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).

Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).

Porting Notes:
* Reconciled conflicts with the fastwrite feature.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
2018-12-07 11:09:42 -08:00
loli10K
d48091de81 zed: detect and offline physically removed devices
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.

We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1537
Closes #7926
2018-11-09 11:17:24 -08:00
Tony Hutter
ad796b8a3b Add zpool status -s (slow I/Os) and -p (parseable)
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.

 	NAME         STATE     READ WRITE CKSUM  SLOW
 	testpool     ONLINE       0     0     0     -
	  mirror-0   ONLINE       0     0     0     -
 	    loop0    ONLINE       0     0     0    20
 	    loop1    ONLINE       0     0     0     0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7756
Closes #6885
2018-11-08 16:47:24 -08:00
Don Brady
e89f1295d4 Add libzutil for libzfs or libzpool consumers
Adds a libzutil for utility functions that are common to libzfs and
libzpool consumers (most of what was in libzfs_import.c).  This
removes the need for utilities to link against both libzpool and
libzfs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8050
2018-11-05 11:22:33 -08:00
Tom Caputi
ab4c009e3d Fix dbgmsg printing in ztest and zdb
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:50 -07:00
Matthew Ahrens
d637db98e1 OpenZFS 9681 - ztest failure in spa_history_log_internal due to spa_rename()
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9681
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6aee0ad7
Closes #8041
2018-10-19 12:02:28 -07:00
Tom Caputi
80a91e7469 Defer new resilvers until the current one ends
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.

This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee 
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7732
2018-10-18 21:06:18 -07:00
Allan Jude
9f438c5f94 OpenZFS 9862 - fix typo in comment in vdev_impl.h
Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tony Hutter <hutter2@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/9862
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/84927f52
Closes #8036
2018-10-18 15:09:27 -07:00
Matthew Ahrens
49394a7708 Linux does not HAVE_SMB_SHARE
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8032
2018-10-17 10:31:38 -07:00
Paul Dagnelie
d52d80b700 Add types to featureflags in zfs
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.

This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.

This functionality will be used by the Redacted Send/Receive feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7981
2018-10-16 11:15:04 -07:00
ilbsmart
779a6c0bf6 deadlock between mm_sem and tx assign in zfs_write() and page fault
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes #7939
2018-10-16 11:11:24 -07:00
Brian Behlendorf
27f80e85c2 Improved error handling for extreme rewinds
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing.  While this is the most common cause, it's not the only
cause.  Attemping to access a damaged ZAP will result in other
errors.

The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import.  Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully.  Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7921
2018-10-12 11:24:04 -07:00
Matt Ahrens
5d43cc9a59 OpenZFS 9689 - zfs range lock code should not be zpl-specific
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t.  However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.

We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.

Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
  This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
  check to prevent needless warnings when integrating changes
  which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
  needed to defer the cv_destroy() which was called immediately
  after cv_broadcast().  With d2733258 this should be safe but
  if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
  to this change in order to make it easier to apply OpenZFS 9689.
  - d88895a0, which removed the dummy znode from zvol_state
  - e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function.  Preserved the
  rangelock free list.  The cv_destroy() function will block waiting
  for all processes in cv_wait() to be scheduled and drop their
  reference.  This is done to ensure it's safe to free the condition
  variable.  However, blocking while holding the rl->rl_lock mutex
  can result in a deadlock on Linux.  A free list is introduced to
  defer the cv_destroy() and kmem_free() until after the mutex is
  released.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes #7980
2018-10-11 10:19:33 -07:00
Paul Dagnelie
0391690583 Refactor dmu_recv into its own file
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7982
2018-10-09 14:05:13 -07:00
Brian Behlendorf
d7e4b30a67 Add zfs_refcount_transfer_ownership_many()
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred.  Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;

This issue may occur when using encryption with refcount debugging
enabled.  An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder.  When unsharing the buffer the
p_abd should be transferred.

This issue does not impact production builds because refcount holders
are not tracked.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7219
Closes #8000
2018-10-09 10:05:48 -07:00
Matthew Ahrens
dfbe267503 OpenZFS 9617 - too-frequent TXG sync causes excessive write inflation
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
  changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
2018-10-04 13:13:28 -07:00
Paul Dagnelie
95542372e6 Add new fnvlist_lookup_* functions
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7977
2018-10-03 15:30:55 -07:00
Tom Caputi
52ce99dd61 Refcounted DSL Crypto Key Mappings
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.

In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7949
2018-10-03 09:47:11 -07:00
Tim Schumacher
424fd7c3e0 Prefix all refcount functions with zfs_
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.

To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7963
2018-10-01 10:42:05 -07:00
Matthew Ahrens
fc23d59fa0 Remove duplicate macro in dsl_dir.h
The DD_FIELD_LAST_REMAP_TXG macro was added twice (with the same value).
This change removes one of them.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7968
2018-10-01 10:40:11 -07:00
John Gallagher
d12614521a Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):

* We don't account for the fact that seq_file sometimes visits a node
  multiple times, which results in missing messages when read through
  procfs.
* We don't keep separate state for each reader of a file, so concurrent
  readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
  the list between read syscalls, so reading from these files in procfs
  can cause the system to crash.

This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
2018-09-26 11:08:12 -07:00
Tim Schumacher
c13060e478 Linux 4.19-rc3+ compat: Remove refcount_t compat
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.

Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.

Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7885 
Closes #7932
2018-09-26 10:29:26 -07:00
Don Brady
73a5ec30bf Fix in-kernel sysfs entries
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module).  When not built as a module add
the sysfs entries under /sys/fs/zfs/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7868 
Closes #7872
2018-09-06 21:44:52 -07:00
Don Brady
cc99f275a2 Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
2018-09-05 18:33:36 -07:00
Don Brady
b83a0e2dc1 Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).

Programmatically document, via a zfs_ioc_key_t, the valid arguments 
for the ioc commands that rely on nvpair input arguments (i.e. non 
legacy commands from libzfs_core). Automatically verify the expected 
pairs before dispatching a command.

This initial phase focuses on the non-legacy ioctls. A follow-on 
change can address the legacy ioctl input from the zfs_cmd_t.

The zfs_ioc_key_t for zfs_keys_channel_program looks like:

static const zfs_ioc_key_t zfs_keys_channel_program[] = {
       {"program",     DATA_TYPE_STRING,               0},
       {"arg",         DATA_TYPE_UNKNOWN,              0},
       {"sync",        DATA_TYPE_BOOLEAN_VALUE,        ZK_OPTIONAL},
       {"instrlimit",  DATA_TYPE_UINT64,               ZK_OPTIONAL},
       {"memlimit",    DATA_TYPE_UINT64,               ZK_OPTIONAL},
};

Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE, 
EBADF, and E2BIG).

ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7780
2018-09-02 12:14:01 -07:00
Don Brady
e8bcb693d6 Add zfs module feature and property info to sysfs
This extends our sysfs '/sys/module/zfs' entry to include feature 
and property attributes. The primary consumer of this information 
is user processes, like the zfs CLI, that need to know what the 
current loaded ZFS module supports. The libzfs binary will consult 
this information when instantiating the zfs and zpool property 
tables and the pool features table.

This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
  features.runtime
  features.pool
  properties.dataset
  properties.pool

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7706
2018-09-02 12:09:53 -07:00
Tom Caputi
47ab01a18f Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-27 10:16:28 -07:00
Tom Caputi
8c4fb36a24 Small rework of txg_list code
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7795
2018-08-27 10:16:01 -07:00
Serapheim Dimitropoulos
a448a2557e Introduce read/write kstats per dataset
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
2018-08-20 09:52:37 -07:00
Tom Caputi
1fff937a4c Check encrypted dataset + embedded recv earlier
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:49:19 -07:00
Tom Caputi
d9c460a0b6 Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:

* Receiving an unencrypted stream as encrypted without
  needing to create a "dummy" encrypted parent so that
  encryption can be inheritted.

* Allowing users to change their keylocation on receive,
  so long as the receiving dataset is an encryption root.

* Allowing users to explicitly exclude or override the
  encryption property from an unencrypted properties stream,
  allowing it to be received as encrypted.

* Receiving a recursive heirarchy of unencrypted datasets,
  encrypting the top-level one and forcing all children to
  inherit the encryption.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:48:49 -07:00
Matthew Ahrens
62840030a7 Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching.  We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
2018-08-02 15:51:45 -07:00
John Gallagher
499b5497cb Add missing checks to zpl_xattr_* functions
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #5866 
Closes #7761
2018-08-02 14:03:56 -07:00
Nathan Lewis
010d12474c Add support for selecting encryption backend
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
  that control the crypto implementation.  At the moment there is a
  choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
  15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
  with as key schedules generated with various implementations
  are not necessarily interchangable.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes #7102 
Closes #7103
2018-08-02 11:59:24 -07:00
George Wilson
3d503a76e8 Fix OpenZFS 9337 mismerge
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.

This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #7758
2018-08-02 10:21:48 -07:00
Paul Dagnelie
492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
Don Brady
dae3e9ea21 OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
  zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
  were intentially not added as tunable module parameters.

OpenZFS-issue: https://illumos.org/issues/9465
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef
Closes #7749
2018-07-30 11:30:41 -07:00
Serapheim Dimitropoulos
6b64382b17 OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations
= Motivation

While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.

An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists.  Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg

Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.

= What this patch does

In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).

= Performance Analysis

To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.

== Results (before patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        5.3
user        0.4
sys         2.3

[Trial 5]
real        8.2
user        0.4
sys         2.4

[Trial 6]
real        6.0
user        0.5
sys         2.3
```

== Results (after patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        4.9
user        0.0
sys         0.9

[Trial 5]
real        3.8
user        0.0
sys         0.9

[Trial 6]
real        3.6
user        0.0
sys         0.9
```

== Analysis

The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.

From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts.  Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().

From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).

Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
  zpool_do_events_nvprint().

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1
Closes #7748
2018-07-30 11:30:03 -07:00
Matthew Ahrens
3a549dc7a1 OpenZFS 9442 - decrease indirect block size of spacemaps
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
  and dmu_object_alloc_dnsize() functions to use a common shared
  dmu_object_alloc_impl() function.

OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b
Closes #7712
2018-07-25 14:11:35 -07:00
Feng Sun
750e1f88d3 Introduce kstat dmu_tx_dirty_frees_delay
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #7718
2018-07-25 09:52:27 -07:00
Brian Behlendorf
d441e85dd7
Add support for autoexpand property
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report the maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #120
Closes #2437
Closes #5771
Closes #7366
Closes #7582
Closes #7629
2018-07-23 15:40:15 -07:00
Matthew Ahrens
2e5dc449c1 OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
2018-07-12 10:49:27 -07:00
Brian Behlendorf
ac09630d8b
Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock.  This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7598 
Closes #7659 
Closes #7691 
Closes #7693
2018-07-11 15:49:10 -07:00
Serapheim Dimitropoulos
4d044c4c1d OpenZFS 9238 - ZFS Spacemap Encoding V2
Motivation
==========

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
    This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
    device removal, zpool checkpoint) we've started imposing limits in the
    vdevs that can be used with them based on the maximum addressable offset
    (currently 64PB for a top-level vdev).

New encoding
============

The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.

Other references:
=================

The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
2018-07-05 12:02:34 -07:00
Chunwei Chen
edf60b8645 Enforce PROP_ONETIME on zpool properties
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #7661
2018-06-28 14:49:17 -07:00
Serapheim Dimitropoulos
d2734cce68 OpenZFS 9166 - zfs storage pool checkpoint
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
2018-06-26 10:07:42 -07:00
Brian Behlendorf
6413c95fbd
Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe.  As part of this change the current_time() function was
also updated to return the timespec64 type.

Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode.  It should
be used when working with inode timestamps to ensure matching types.

The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t.  Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.

Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other.  They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
2018-06-19 21:51:18 -07:00
Tom Caputi
cd32e5db8b Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
2018-06-18 14:10:54 -07:00
John Gallagher
917f475fba Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618
2018-06-15 15:10:42 -07:00
Allan Jude
29445fe3a0 Reserve DMU_BACKUP_FEATURE for ZSTD
Reserve bit 25 for the ZSTD compression feature from FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #7626
2018-06-14 09:47:26 -07:00
Matthew Ahrens
1fac63e56f OpenZFS 9577 - remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
   TSD values (by setting to NULL vs non-NULL) is expensive, and
   we do it very often.
3. According to Nexenta, the current semantics can cause a
   deadlock when concurrently calling dmu_objset_evict_dbufs()
   (which is rare today, but they are working on a "parallel
   unmount" change that triggers this more easily):

Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes #7602
2018-06-13 11:05:06 -07:00
Tom Caputi
e7504d7a18 Raw receive functions must not decrypt data
This patch fixes a small bug found where receive_spill() sometimes
attempted to decrypt spill blocks when doing a raw receive. In
addition, this patch fixes another small issue in arc_buf_fill()'s
error handling where a decryption failure (which could be caused by
the first bug) would attempt to set the arc header's IO_ERROR flag
without holding the header's lock.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7564 
Closes #7584 
Closes #7592
2018-06-06 10:16:41 -07:00
Paul Dagnelie
37fb3e4318 OpenZFS 8484 - Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Paul Dagnelie <pcd@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7
Issue #3752
Closes #7462
2018-06-06 09:35:59 -07:00
Tony Hutter
f0ed6c7448 Add pool state /proc entry, "SUSPENDED" pools
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7331 
Closes #7563
2018-06-06 09:33:54 -07:00
Brian Behlendorf
2d9142c9d4
Remove rwlock wrappers
The only remaining consumer of the rwlock compatibility wrappers
is ztest.  Remove the wrappers and convert the few remaining
calls to the underlying pthread functions.

    rwlock_init()    -> pthread_rwlock_init()
    rwlock_destroy() -> pthread_rwlock_destroy()
    rw_rdlock()      -> pthread_rwlock_rdlock()
    rw_wrlock()      -> pthread_rwlock_wrlock()
    rw_unlock()      -> pthread_rwlock_unlock()

Note pthread_rwlock_init() defaults to PTHREAD_PROCESS_PRIVATE
which is equivilant to the USYNC_THREAD behavior.  There is no
functional change.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7591
2018-06-04 16:52:10 -07:00
Serapheim Dimitropoulos
e48afbc4eb OpenZFS 9464 - txg_kick() fails to see that we are quiescing
txg_kick() fails to see that we are quiescing, forcing transactions to
their next stages without leaving them accumulate changes

Creating a fragmented pool in a DCenter VM and continuously writing to it with
multiple instances of randwritecomp, we get the following output from txg.d:

    0ms   311MB in  4114ms (95% p1)  75MB/s  544MB (76%)  336us   153ms     0ms
    0ms     8MB in    51ms ( 0% p1) 163MB/s  474MB (66%)  129us    34ms     0ms
    0ms   366MB in  4454ms (93% p1)  82MB/s  572MB (79%)  498us    20ms     0ms
    0ms   406MB in  5212ms (95% p1)  77MB/s  591MB (82%)  661us    37ms     0ms
    0ms   340MB in  5110ms (94% p1)  66MB/s  622MB (86%) 1048us    41ms     1ms
    0ms     3MB in    61ms ( 0% p1)  51MB/s  419MB (58%)   33us     0ms     0ms
    0ms   361MB in  3555ms (88% p1) 101MB/s  542MB (75%)  335us    40ms     0ms
    0ms   356MB in  4592ms (92% p1)  77MB/s  561MB (78%)  430us    89ms     1ms
    0ms    11MB in   129ms (13% p1)  90MB/s  507MB (70%)  222us    15ms     0ms
    0ms   281MB in  2520ms (89% p1) 111MB/s  542MB (75%)  334us    42ms     0ms
    0ms   383MB in  3666ms (91% p1) 104MB/s  557MB (77%)  411us   133ms     0ms
    0ms   404MB in  5757ms (94% p1)  70MB/s  635MB (88%) 1274us   123ms     2ms
    4ms   367MB in  4172ms (89% p1)  88MB/s  556MB (77%)  401us    51ms     0ms
    0ms    42MB in   470ms (44% p1)  90MB/s  557MB (77%)  412us    43ms     0ms
    0ms   261MB in  2273ms (88% p1) 114MB/s  556MB (77%)  407us    27ms     0ms
    0ms   394MB in  3646ms (85% p1) 108MB/s  552MB (77%)  393us   304ms     0ms
    0ms   275MB in  2416ms (89% p1) 113MB/s  510MB (71%)  200us    53ms     0ms
    0ms     9MB in    53ms ( 0% p1) 169MB/s  483MB (67%)  140us   100ms     1ms

The TXGs that are getting synced and don't have lots of changes are pushed by
txg_kick() which basically forces the current open txg to get to the quiesced
state:

        if (tx->tx_syncing_txg == 0 &&
        tx->tx_quiesce_txg_waiting <= tx->tx_open_txg &&
        tx->tx_sync_txg_waiting <= tx->tx_synced_txg &&
        tx->tx_quiesced_txg <= tx->tx_synced_txg) {
        tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1;
        cv_broadcast(&tx->tx_quiesce_more_cv);
    }

The problem is that the above code doesn't check if we are currently quiescing
anything (only if a quiesce or a sync has been requested, ..etc) so the
following scenario can happen:

1] We have an open txg A that had enough dirty data (more than
   zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened
   a new txg B. No txg is currently being synced.
2] Immediately after the opening of B, txg_kick() was run by some other write
   (and because of A's dirty data) and saw that we are not currently syncing
   any txg and no one has requested quiescing so it requests one by bumping
   tx_quiesce_txg_waiting and broadcasts the quiesce thread.
3] The quiesce thread just passed txg A to be synced and sees that a quiescing
   request has been sent to it so it immediately grabs B without letting it
   gather enough data, putting it in a quiesced state and opening a new txg C.

In this scenario txg B, is an example of how the entries of interest show up in
the txg.d output.

Ideally we would like txg_kick() to get triggered only when we are sure that
we are not syncing AND not quiescing any txg. This way we can kick an open TXG
to the quiescing state when we are sure that there is nothing going on and we
would benefit from the different states running concurrently.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9464
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635b
Closes #7587
2018-06-04 14:56:06 -07:00
Pavel Zakharov
8a393be353 OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_t
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44
Closes #7532
2018-06-04 14:54:20 -07:00
Sara Hartse
74d42600d8 zpool reopen should detect expanded devices
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.

Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes #7546 
Issue #7582
2018-05-31 10:36:37 -07:00
Brian Behlendorf
93ce2b4ca5 Update build system and packaging
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.

Build system and packaging:
  * Renamed SPL_* autoconf m4 macros to ZFS_*.
  * Removed redundant SPL_* autoconf m4 macros.
  * Updated the RPM spec files to remove SPL package dependency.
  * The zfs package obsoletes the spl package, and the zfs-kmod
    package obsoletes the spl-kmod package.
  * The zfs-kmod-devel* packages were updated to add compatibility
    symlinks under /usr/src/spl-x.y.z until all dependent packages
    can be updated.  They will be removed in a future release.
  * Updated copy-builtin script for in-kernel builds.
  * Updated DKMS package to include the spl.ko.
  * Updated stale AUTHORS file to include all contributors.
  * Updated stale COPYRIGHT and included the SPL as an exception.
  * Renamed README.markdown to README.md
  * Renamed OPENSOLARIS.LICENSE to LICENSE.
  * Renamed DISCLAIMER to NOTICE.

Required code changes:
  * Removed redundant HAVE_SPL macro.
  * Removed _BOOT from nvpairs since it doesn't apply for Linux.
  * Initial header cleanup (removal of empty headers, refactoring).
  * Remove SPL repository clone/build from zimport.sh.
  * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
    to build issues when forcing C99 compilation.
  * Replaced legacy ACCESS_ONCE with READ_ONCE.
  * Include needed headers for `current` and `EXPORT_SYMBOL`.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
2018-05-29 16:00:33 -07:00
Matthew Ahrens
0dc2f70c5c OpenZFS 9486 - reduce memory used by device removal on fragmented pools
Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

* Add the following as module parameters:
    * zfs_condense_indirect_vdevs_enable
    * zfs_condense_max_obsolete_bytes

* Document the following module parameters:
   * zfs_condense_indirect_vdevs_enable
   * zfs_condense_max_obsolete_bytes
   * zfs_condense_min_mapping_bytes

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c
External-issue: DLPX-57962
Closes #7536
2018-05-24 10:18:07 -07:00
Pavel Zakharov
6cb8e5306d OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

	7638 Refactor spa_load_impl into several functions
	8961 SPA load/import should tell us why it failed
	7277 zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.

Porting notes (ZTS):
* Fix 'make dist' target in zpool_import

* The maximum path length allowed by tar is 99 characters.  Several
  of the new test cases exceeded this limit resulting in them not
  being included in the tarball.  Shorten the names slightly.

* Set/get tunables using accessor functions.

* Get last synced txg via the "zfs_txg_history" mechanism.

* Clear zinject handlers in cleanup for import_cache_device_replaced
  and import_rewind_device_replaced in order that the zpool can be
  exported if there is an error.

* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
  ext4 file system to be created on ZFS_DISK2.  Also, there's
  no need to partition ZFS_DISK2 at all.  The partitioning had
  already been disabled for multipath devices.  Among other things,
  the partitioning steals some space from the ext4 file system,
  makes it difficult to accurately calculate the paramters to
  parted and can make some of the tests fail.

* Increase FS_SIZE and FILE_SIZE in the zpool_import test
  configuration now that FILESIZE is larger.

* Write more data in order that device evacuation take lonnger in
  a couple tests.

* Use mkdir -p to avoid errors when the directory already exists.

* Remove use of sudo in import_rewind_config_changed.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2018-05-08 21:35:27 -07:00
Pavel Zakharov
4a0ee12af8 OpenZFS 8961 - SPA load/import should tell us why it failed
Problem
=======

When we fail to open or import a storage pool, we typically don't
get any additional diagnostic information, just "no pool found" or
"can not import".

While there may be no additional user-consumable information, we should
at least make this situation easier to debug/diagnose for developers
and support.  For example, we could start by using `zfs_dbgmsg()`
to log each thing that we try when importing, and which things
failed. E.g. "tried uberblock of txg X from label Y of device Z". Also,
we could log each of the stages that we go through in `spa_load_impl()`.

Solution
========

Following the cleanup to `spa_load_impl()`, debug messages have been
added to every point of failure in that function. Additionally,
debug messages have been added to strategic places, such as
`vdev_disk_open()`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8961
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0
Closes #7459
2018-05-08 21:30:10 -07:00
Tom Caputi
be9a5c355c Add support for decryption faults in zinject
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.

This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7474
2018-05-02 15:36:20 -07:00
Brian Behlendorf
9464b9591e
RHEL 7.5 compat: FMODE_KABI_ITERATE
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.

Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:

* Requires that callers intending to use this extended interface
  set the FMODE_KABI_ITERATE flag on the file structure when
  opening the directory.
* Adds the fops.iterate() method to the end of the structure,
  without removing fops.readdir().

This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected.  Instead fallback to
the fops.readdir() interface which will be available.

Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7460 
Closes #7463
2018-05-02 15:01:24 -07:00
Matthew Ahrens
964c2d69a9 OpenZFS 9236 - nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5

OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668
Closes #7467
2018-04-30 10:19:48 -07:00
Chunwei Chen
599b864813 Fix ENOSPC in "Handle zap_add() failures in ..."
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7401 
Closes #7421
2018-04-18 14:19:50 -07:00
Matthew Ahrens
0c03d21ac9 assertion in arc_release() during encrypted receive
In the existing code, when doing a raw (encrypted) zfs receive, 
we call arc_convert_to_raw() from open context. This creates a 
race condition between arc_release()/arc_change_state() and 
writing out the block from syncing context (arc_write_ready/done()).

This change makes it so that when we are doing a raw (encrypted) 
zfs receive, we save the crypt parameters (salt, iv, mac) of dnode 
blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() 
from syncing context when writing out the block of dnodes.

Additionally, we can eliminate dr_raw and associated setters, and 
instead know that dnode blocks are always raw when doing a zfs 
receive (see the new field os_raw_receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7424 
Closes #7429
2018-04-17 11:06:54 -07:00
Serapheim Dimitropoulos
9d5b524597 OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2018-04-14 12:23:53 -07:00
Matthew Ahrens
9e052db462 OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-04-14 12:21:39 -07:00
Matthew Ahrens
a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Tom Caputi
edc1e713c2 Fix race in dnode_check_slots_free()
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147 
Closes #7388
2018-04-10 11:15:05 -07:00
Tony Hutter
4f301661df Revert "Handle zap_add() failures in mixed ... "
This reverts commit cc63068e95.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401 
Cloes #7416
2018-04-09 14:24:46 -07:00
Matthew Ahrens
5c27ec1088 Fixes for SNPRINTF_BLKPTR with encrypted BP's
mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.

Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7390
2018-04-06 13:30:26 -07:00
Tom Caputi
a2c2ed1bd4 Decryption error handling improvements
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.

In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7372
2018-03-31 11:12:51 -07:00
LOLi
77d8a0f1a4 Fix hung z_zvol tasks during 'zfs receive'
During a receive operation zvol_create_minors_impl() can wait
needlessly for the prefetch thread because both share the same tasks
queue.  This results in hung tasks:

<3>INFO: task z_zvol:5541 blocked for more than 120 seconds.
<3>      Tainted: P           O  3.16.0-4-amd64
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The first z_zvol:5541 (zvol_task_cb) is waiting for the long running
traverse_prefetch_thread:260

root@linux:~# cat /proc/spl/taskq
taskq                       act  nthr  spwn  maxt   pri  mina
spl_system_taskq/0            1     2     0    64   100     1
	active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40)
	wait: 5541
spl_delay_taskq/0             0     1     0     4   100     1
	delay: spa_deadman [zfs](0xffff880039924000)
z_zvol/1                      1     1     0     1   120     1
	active: [5541]zvol_task_cb [zfs](0xffff88001fde6400)
	pend: zvol_task_cb [zfs](0xffff88001fde6800)

This change adds a dedicated, per-pool, prefetch taskq to prevent the
traverse code from monopolizing the global (and limited) system_taskq by
inappropriately scheduling long running tasks on it.

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6330 
Closes #6890 
Closes #7343
2018-03-30 12:10:01 -07:00
Andriy Gapon
5e00213e43 OpenZFS 9164 - assert: newds == os->os_dsl_dataset
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
  to successfully run in under 5 minutes.

OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a
Closes #6112
Closes #7336
2018-03-30 12:00:40 -07:00
Matthew Ahrens
2fd92c3d6c enable zfs_dbgmsg() by default, without dprintf()
zfs_dbgmsg() should record a message by default.  As a general
principal, these messages shouldn't be too verbose.  Furthermore, the
amount of memory used is limited to 4MB (by default).

dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags.  This flag is not set by default
(even on debug builds).  These messages are extremely verbose, and
sometimes nontrivial to compute.

SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags.  This flag is not set by default (even on debug builds).

This brings our behavior in line with illumos.  Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7314
2018-03-21 15:37:32 -07:00
Tom Caputi
089fbf313c Add comments for portable dnode / objset flags
This patch adds some comments describing the purpose of "portable"
dnode and objset flags so that it is clear when new flags should
be added to the repective flag masks. This patch includes no
functional changes.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7313
2018-03-20 11:55:21 -07:00
Olaf Faaland
cec3a0a1bb Report pool suspended due to MMP
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.

Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.

When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.

In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.

In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7296
2018-03-15 10:56:55 -07:00
Chunwei Chen
9470cbd4f9 Fix race in trace point in zrl_add_impl
We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.

==
For example:

Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"

Thread A                                 Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
  DTRACE_PROBE2() {
    __string() {
      strlen(zrl->zr_caller) -> 3
      allocate buf[4]
    }

                                        zrl->zr_owner = B
				        zrl->zr_caller = "abcd"

    __assign_str() {
      strcpy(buf, zrl->zr_caller) <- buffer overflow
==

Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.

To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7291
2018-03-12 11:27:02 -07:00
Olaf Faaland
d2160d0538 Record skipped MMP writes in multihost_history
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:15:15 -08:00
Tony Hutter
80d52c3919 Change checksum & IO delay ratelimit values
Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.

Also, convert the checksum and IO delay thresholds to module params for
easy testing.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7252
2018-03-04 17:34:51 -08:00
Tom Caputi
095495e008 Raw DRR_OBJECT records must write raw data
b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7215 
Closes #7236
2018-02-27 09:04:05 -08:00
Tim Chase
8b5814393f Incorrect maximum DVA value in DDE_GET_NDVAS()
The conditional was reversed which caused garbage values to be used when
calculating dds_ref_dsize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7234
2018-02-26 14:20:12 -08:00
Olaf Faaland
7088545d01 Report duration and error in mmp_history entries
After an MMP write completes, update the relevant mmp_history entry
with the time between submission and completion, and the error
status of the write.

[faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost
39 0 0x01 100 8800 69147946270893 72723903122926
id       txg     timestamp  error  duration   mmp_delay    vdev_guid
10607    1166    1518985089 0      138301     637785455    4882...
10608    1166    1518985089 0      136154     635407747    1151...
10609    1166    1518985089 0      803618560  633048078    9740...
10610    1166    1518985090 0      144826     633048078    4882...
10611    1166    1518985090 0      164527     666187671    1151...

Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and
error = zio->io_error.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7190
2018-02-22 15:34:34 -08:00
Tom Caputi
b0918402dc Raw receive should change key atomically
Currently, raw zfs sends transfer the encrypted master keys and
objset_phys_t encryption parameters in the DRR_BEGIN payload of
each send file. Both of these are processed as soon as they are
read in dmu_recv_stream(), meaning that the new keys are set
before the new snapshot is received. In addition to the fact that
this changes the user's keys for the dataset earlier than they
might expect, the keys were never reset to what they originally
were in the event that the receive failed. This patch splits the
processing into objset handling and key handling, the later of
which is moved to dmu_recv_end() so that they key change can be
done atomically.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7200
2018-02-21 12:31:03 -08:00
Tom Caputi
b1d217338a Raw receives must compress metadnode blocks
Currently, the DMU relies on ZIO layer compression to free LO
dnode blocks that no longer have objects in them. However,
raw receives disable all compression, meaning that these blocks
can never be freed. In addition to the obvious space concerns,
this could also cause incremental raw receives to fail to mount
since the MAC of a hole is different from that of a completely
zeroed block.

This patch corrects this issue by adding a special case in
zio_write_compress() which will attempt to compress these blocks
to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also
removes the zfs_mdcomp_disable tunable, since tuning it could
cause these same issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7198
2018-02-21 12:28:52 -08:00
Tom Caputi
163a8c28dd ZIL claiming should not start user accounting
Currently, ZIL claiming dirties objsets which causes
dsl_pool_sync() to attempt to perform user accounting on
them. This causes problems for encrypted datasets that were
raw received before the system went offline since they
cannot perform user accounting until they have their keys
loaded. This triggers an ASSERT in zio_encrypt(). Since
encryption was added, the code now depends on the fact that
data should only be written when objsets are owned. This
patch adds a check in dmu_objset_do_userquota_updates()
to ensure that useraccounting is only done when the objsets
are actually owned for write. As part of this work, the
zfsvfs and zvol code was updated so that it no longer lies
about owning objsets readonly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6916 
Closes #7163
2018-02-20 16:27:31 -08:00
Don Brady
cbce581353 Fix coverity defects: zfs channel programs
CID 173243, 173245:  Memory - corruptions  (OVERRUN)
 Added size argument to lcompat_sprintf() to avoid use of INT_MAX

CID 173244:  Integer handling issues  (OVERFLOW_BEFORE_WIDEN)
 Added cast to uint64_t to avoid a 32 bit overflow warning

CID 173242:  Integer handling issues  (CONSTANT_EXPRESSION_RESULT)
 Conditionally removed unused luai_numisnan() floating point check

CID 173241:  Resource leaks  (RESOURCE_LEAK)
 Added missing close(fd) on error path

CID 173240:    (UNINIT)
Fixed uninitialized variable in get_special_prop()

CID 147560:  Null pointer dereferences  (NULL_RETURNS)
Cleaned up bad code merge in dsl_dataset_promote_check()

CID 28475:  Memory - illegal accesses  (OVERRUN)
Fixed lcompat_sprintf() to use a size paramater

CID 28418, 28422:  Error handling issues  (CHECKED_RETURN)
Added function result cast to (void) to avoid warning

CID 23935, 28411, 28412:  Memory - corruptions  (ARRAY_VS_SINGLETON)
Added casts to avoid exposing result as an array

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7181
2018-02-20 11:19:42 -08:00
George Wilson
ddc751d56b OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio
PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c
Issue #5918
Closes #7168
2018-02-14 15:30:09 -08:00
Nasf-Fan
9c5167d19f Project Quota on ZFS
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.

Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').

By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.

Support the following commands and functionalities:

zfs set projectquota@project
zfs set projectobjquota@project

zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project

zfs projectspace

zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused

zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused

chattr +/-P
chattr -p project_id
lsattr -p

This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>

For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.

Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by  Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes #6290
2018-02-13 14:54:54 -08:00
sanjeevbagewadi
cc63068e95 Handle zap_add() failures in mixed case mode
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #7011 
Closes #7054
2018-02-09 10:15:53 -08:00
Brian Behlendorf
18f57327e0 Linux 4.16 compat: inode_set_iversion()
A new interface was added to manipulate the version field of an
inode.  Add a inode_set_iversion() wrapper for older kernels and
use the new interface when available.

The i_version field was dropped from the trace point due to the
switch to an atomic64_t i_version type.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
2018-02-08 21:25:19 -08:00
Serapheim Dimitropoulos
5b72a38d68 OpenZFS 8677 - Open-Context Channel Programs
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.

=== What is implemented?

This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
  to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
  the context.
- Appropriate error handling whenever we try a channel program in
  open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

=== How do we handle zfs.sync functions in open context?

When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...

arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg

if we run it in open context, we will get back the following error:

Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
            [C]: in function 'destroy'
            [string "channel program"]:3: in main chunk

=== What about testing?

We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.

OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15
Closes #6558
2018-02-08 16:05:57 -08:00
Serapheim Dimitropoulos
8d103d8856 OpenZFS 8604 - Simplify snapshots unmounting code
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

Every time we want to unmount a snapshot (happens during snapshot
deletion or renaming) we unnecessarily iterate through all the
mountpoints in the VFS layer (see zfs_get_vfs).

The current patch completely gets rid of that code and changes
the approach while keeping the behavior of that code path the
same. Specifically, it puts a hold on the dataset/snapshot and
gets its vfs resource reference directly, instead of linearly
searching for it. If that reference exists we attempt to amount
it.

With the above change, it became obvious that the nvlist
manipulations that we do (add_boolean and add_nvlist) take a
significant amount of time ensuring uniqueness of every new
element even though they don't have too. Thus, we updated the
patch so those nvlists are not trying to enforce the uniqueness
of their elements.

A more complete analysis of the problem solved by this patch
can be found below:
https://sdimitro.github.io/post/snap-unmount-perf/

OpenZFS-issue: https://www.illumos.org/issues/8604
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb
2018-02-08 15:29:44 -08:00
Chris Williamson
234c91c508 OpenZFS 8600 - ZFS channel programs - snapshot
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this entails extra
logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.

OpenZFS-issue: https://www.illumos.org/issues/8600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b
2018-02-08 15:29:24 -08:00
Brad Lewis
af07368986 OpenZFS 8592 - ZFS channel programs - rollback
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to perform a rollback.

OpenZFS-issue: https://www.illumos.org/issues/8592
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6
2018-02-08 15:29:14 -08:00
Chris Williamson
d99a015343 OpenZFS 7431 - ZFS Channel Programs
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533

Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
  Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
2018-02-08 15:28:18 -08:00
Tom Caputi
1b66810bad Change os->os_next_write_raw to work per txg
Currently, os_next_write_raw is a single boolean used for determining
whether or not the next call to dmu_objset_sync() should write out
the objset_phys_t as a raw buffer. Since the boolean is not associated
with a txg, the work simply happens during the next txg, which is not
necessarily the correct one. In the current implementation this issue
was misdiagnosed, resulting in a small hack in dmu_objset_sync() which
seemed to resolve the problem.

This patch changes os_next_write_raw to be an array of booleans, one
for each txg in TXG_OFF and removes the hack.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
2018-02-02 11:44:53 -08:00
Tom Caputi
047116ac76 Raw sends must be able to decrease nlevels
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.

This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.

This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6821
Closes #6864
2018-02-02 11:43:11 -08:00
Tom Caputi
ae76f45cda Encryption Stability and On-Disk Format Fixes
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.

Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.

This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.

This patch also contains minor bug fixes and cleanups.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6845
Closes #6864
Closes #7052
2018-02-02 11:37:16 -08:00
Dr. András Korn
4c46b99d24 tx_waited -> tx_dirty_delayed in trace_dmu.h
This change was missed in 0735ecb334.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: András Korn <korn-github.com@elan.rulez.org>
Closes #7096
2018-01-31 16:13:26 -08:00
Prakash Surya
0735ecb334 OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue
PROBLEM
=======

When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible
for either `ERESTART` or `EIO` to be returned.

If `ERESTART` is returned, this will cause an assertion to fail directly
in `zil_lwb_write_issue`, where the code assumes the return value is
`EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the
SPA is suspended when `dmu_tx_assign` is called, and most often occurs
when running `zloop`.

If `EIO` is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, `zil_commit_waiter_timeout` contains the
following logic:

    lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
    ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);

In this case, if `dmu_tx_assign` returned `EIO` from within
`zil_lwb_write_issue`, the `lwb` variable passed in will not be issued
to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and
this assertion will fail. `zil_commit_waiter_timeout` assumes that after
it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where `dmu_tx_assign` returns `EIO`.

SOLUTION
========

This change modifies the `dmu_tx_assign` function such that `txg_how` is
a bitmask, rather than of the `txg_how_t` enum type. Now, the previous
`TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with
specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics.

Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was
automatically invoked. This was not ideal when using `TXG_WAITED` within
`zil_lwb_write_issued`, leading the problem described above. Rather, we
want to achieve the semantics of `TXG_WAIT`, while also preventing the
`tx` from being penalized via the dirty delay throttling.

With this change, `zil_lwb_write_issued` can acheive the semtantics that
it requires by passing in the value `TXG_WAIT | TXG_NOTHROTTLE` to
`dmu_tx_assign`.

Further, consumers of `dmu_tx_assign` wishing to achieve the old
`TXG_WAITED` semantics can pass in the value `TXG_NOWAIT | TXG_NOTHROTTLE`.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE`

OpenZFS-issue: https://www.illumos.org/issues/8997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9
Closes #7084
2018-01-26 20:19:46 -08:00
Brian Behlendorf
8fb1ede146 Extend deadman logic
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems.  The proposed changes include:

* Added a new `zfs_deadman_failmode` module option which is
  used to dynamically control the behavior of the deadman.  It's
  loosely modeled after, but independant from, the pool failmode
  property.  It can be set to wait, continue, or panic.

    * wait     - Wait for the "hung" I/O (default)
    * continue - Attempt to recover from a "hung" I/O
    * panic    - Panic the system

* Added a new `zfs_deadman_ziotime_ms` module option which is
  analogous to `zfs_deadman_synctime_ms` except instead of
  applying to a pool TXG sync it applies to zio_wait().  A
  default value of 300s is used to define a "hung" zio.

* The ztest deadman thread has been re-enabled by default,
  aligned with the upstream OpenZFS code, and then extended
  to terminate the process when it takes significantly longer
  to complete than expected.

* The -G option was added to ztest to print the internal debug
  log when a fatal error is encountered.  This same option was
  previously added to zdb in commit fa603f82.  Update zloop.sh
  to unconditionally pass -G to obtain additional debugging.

* The FM_EREPORT_ZFS_DELAY event which was previously posted
  when the deadman detect a "hung" pool has been replaced by
  a new dedicated FM_EREPORT_ZFS_DEADMAN event.

* The proposed recovery logic attempts to restart a "hung"
  zio by calling zio_interrupt() on any outstanding leaf zios.
  We may want to further restrict this to zios in either the
  ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
  Calling zio_interrupt() is expected to only be useful for
  cases when an IO has been submitted to the physical device
  but for some reasonable the completion callback hasn't been
  called by the lower layers.  This shouldn't be possible but
  has been observed and may be caused by kernel/driver bugs.

* The 'zfs_deadman_synctime_ms' default value was reduced from
  1000s to 600s.

* Depending on how ztest fails there may be no cache file to
  move.  This should not be considered fatal, collect the logs
  which are available and carry on.

* Add deadman test cases for spa_deadman() and zio_wait().

* Increase default zfs_deadman_checktime_ms to 60s.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
2018-01-25 13:40:38 -08:00
Brian Behlendorf
31864e3d8c
OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL
usr/src/uts/common/sys/fs/zfs.h
	Change ZPROP_INVAL and ZPROP_CONT from macros to enum values.  Clang
	and GCC both prefer to use unsigned ints to store enums.  That was
	causing tautological comparison warnings (and likely eliminating
	error handling code at compile time) whenever a zfs_prop_t or
	zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT.  Making the
	error flags be explicity enum values forces the enum types to be
	signed.

	ZPROP_INVAL was also compared against two different enum types.  I
	had to change its name to ZPOOL_PROP_INVAL whenever its compared to
	a zpool_prop_t.  There are still some places where ZPROP_INVAL or
	ZPROP_CONT is compared to a plain int, in code that doesn't know
	whether the int is storing a zfs_prop_t or a zpool_prop_t.

usr/src/uts/common/fs/zfs/spa.c
	s/ZPROP_INVAL/ZPOOL_PROP_INVAL/

Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74
Closes #7061
2018-01-19 09:22:37 -08:00
Sean Eric Fagan
43cb30b3ce OpenZFS 8959 - Add notifications when a scrub is paused or resumed
Authored by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
- Brought #defines in eventdefs.h in line with ZFS on Linux format.
- Updated zfs-events.5 with the new events.

OpenZFS-issue: https://www.illumos.org/issues/8959
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eea
Closes #7049
2018-01-17 10:31:00 -08:00
Brian Behlendorf
e1a0850c35
Force ztest to always use /dev/urandom
For ztest, which is solely for testing, using a pseudo random
is entirely reasonable.  Using /dev/urandom ensures the system
entropy pool doesn't get depleted thus stalling the testing.
This is a particular problem when testing in VMs.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7017 
Closes #7036
2018-01-12 09:36:26 -08:00
Nathaniel Wesley Filardo
cba6fc61a2 Revert raidz_map and _col structure types
As part of the refactoring of ab9f4b0b82,
several uint64_t-s and uint8_t-s were changed to other types.  This
caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM
machine.  In absense of any strong motivation for the type changes, this
simply puts them back, modulo the changes accumulated for ABD.

Compile-tested on amd64 and run-tested on armhf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Closes #6981 
Closes #7023
2018-01-09 14:46:52 -08:00
Brian Behlendorf
0873bb6337
Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified.  As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf.  Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list.  This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6171 
Closes #6852 
Closes #6989
2018-01-08 09:52:36 -08:00
Prakash Surya
2fe61a7ecc OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
2017-12-28 10:18:04 -08:00
lidongyang
823d48bfb1 Call commit callbacks from the tail of the list
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes #6986
2017-12-22 10:19:51 -08:00
Tom Caputi
a8b2e30685 Support re-prioritizing asynchronous prefetches
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.

This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6921 
Closes #6926
2017-12-21 09:13:06 -08:00
Brian Behlendorf
c28a67733c
Suppress incorrect objtool warnings
Suppress incorrect warnings from versions of objtool which are not
aware of x86 EVEX prefix instructions used for AVX512.

  module/zfs/vdev_raidz_math_avx512bw.o: warning:
  objtool: <func+offset>: can't find jump dest instruction at .text

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6928
2017-12-07 10:28:50 -08:00