Commit Graph

2359 Commits

Author SHA1 Message Date
Allan Jude
eabf270b2c
Remove duplicate include of sys/zfeature.h in dmu_objset.c
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10636
2020-07-31 09:04:45 -07:00
Matthew Ahrens
948423a3d1
zfs promote does not delete livelist of origin
When a clone is promoted, its livelist is no longer accurate, so it is
discarded.  If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B

If we promote C, it discards C's livelist.  It should discard B's
livelist, but that is not happening.  The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C.  The incorrectly-freed blocks can be reallocated causing
checksum errors.  And when C is destroyed it can double-free the
incorrectly-freed blocks.

The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir.  So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed.  As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10652
2020-07-31 08:59:00 -07:00
Matthew Ahrens
3a92552f75
Fix error handling of vdev_top_zap
In `vdev_load()`, we look up several entries in the `vdev_top_zap`
object.  In most cases, if we encounter an i/o error, it will be
returned to the caller.  However, when handling
`VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue
on, which in theory could cause us to not realize that a vdev should be
used only for `special` allocations.

In practice, if we encountered an i/o error while looking for
`VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an
i/o error while looking for other entries in the same object, and thus
the zpool open/import would fail.  Therefore the impact of this problem
is negligible.

This commit adds error handling for i/o errors while accessing the
`vdev_top_zap`, so that we aren't relying on unrelated code to fail for
us.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10637
2020-07-29 17:04:34 -07:00
Matthew Macy
27d96d2254
Rename refcount.h to zfs_refcount.h
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10620
2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos
843e9ca2e1
Introduce names for ZTHRs
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #10630
2020-07-29 09:43:33 -07:00
Matthew Macy
5678d3f593
Prefix zfs internal endian checks with _ZFS
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10621
2020-07-28 13:02:49 -07:00
Matthew Macy
e64cc4954c
Refactor ccompile.h to not include system headers
This is a step toward being able to vendor the OpenZFS code in FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10625
2020-07-25 20:09:50 -07:00
Matthew Macy
6d8da84106
Make use of ZFS_DEBUG consistent within kmod sources
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10623
2020-07-25 20:07:44 -07:00
Matthew Macy
f5b189f937
FreeBSD: Fixes required to build ZFS on PowerPC
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10622
2020-07-25 11:00:23 -07:00
Brian Atkinson
6fba7bfd0e
Add gang ABD child to parent gang ABD
By design a gang ABD can not have another gang ABD as a child. This is
to make sure the logical offset in a gang ABD is consistent with the
individual ABDS it contains as children. If a gang ABD is added as a
child of a gang ABD we will add the individual children of the gang ABD
to the parent gang ABD. This allows for a consistent view of offsets
within the parent gang ABD.

Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10430
2020-07-24 21:09:20 -07:00
Ryan Moeller
8348fac30c
Limit dbuf cache sizes based only on ARC target size by default
Set the initial max sizes to ULONG_MAX to allow the caches to grow
with the ARC.

Recalculate the metadata cache size on demand so it can adapt, too.

Update descriptions in zfs-module-parameters(5).

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10563 
Closes #10610
2020-07-24 20:38:48 -07:00
Matthew Ahrens
5dd92909c6
Adjust ARC terminology
The process of evicting data from the ARC is referred to as
`arc_adjust`.

This commit changes the term to `arc_evict`, which is more specific.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10592
2020-07-22 09:51:47 -07:00
Matthew Ahrens
026e529cb3
Remove skc_reclaim, hdr_recl, kmem_cache shrinker
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure.  This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function.  This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.

The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`.  When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op.  When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it.  Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.

The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer.  Therefore, additionally waking it from the
shrinker calbacks has little benefit.

The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second.  Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times.  Therefore, this mechanism consumes
significant amounts of CPU time.

The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
 1. Return free objects from the magazine layer to the slab layer
 2. Return entirely-free slabs to the page layer (i.e. free pages)

These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches.  The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.

These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine.  So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.

We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected.  Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.

This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10576
2020-07-19 09:58:30 -07:00
Matthew Ahrens
6774931dfa
Extend zdb to print inconsistencies in livelists and metaslabs
Livelists and spacemaps are data structures that are logs of allocations
and frees.  Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.

These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).

ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab.  However, it
generally halts processing when the error is detected.

When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.

The specific checks added are:

Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]

Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
  spacemap log, then report any double FREEs and double ALLOCs

Verify that livelists are consistenet with spacemaps.  The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes #10515
2020-07-14 17:51:05 -07:00
Alexander Motin
1743c737f5
Fix LOR between dp_config_rwlock and spa_props_lock
Our QE team during automated API testing hit deadlock in ZFS, caused
by lock order reversal.  From one side dsl_sync_task_sync() locks
dp_config_rwlock as writer and calls spa_sync_props(), which waits
for spa_props_lock.  From another spa_prop_get() locks spa_props_lock
and then calls dsl_pool_config_enter(), trying to lock dp_config_rwlock
as reader.

This patch makes spa_prop_get() lock dp_config_rwlock before
spa_props_lock, making the order consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #10553
2020-07-14 12:21:57 -07:00
Brian Atkinson
e4d3d77684
Fixing gang ABD child removal race condition
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10511
2020-07-14 11:04:35 -07:00
Matthew Ahrens
e59a377a8f
filesystem_limit/snapshot_limit is incorrectly enforced against root
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit."  Two types of users are allowed to change
the limit:

1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`.  This works properly.

2. A user with elevated system privileges (e.g. root).  This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.

The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process.  This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.

This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged.  Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.

This problem only impacts Linux.  Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8226
Closes #10545
2020-07-11 17:18:02 -07:00
George Amanakis
2054f35e56
Fix a persistent L2ARC bug in l2arc_write_done()
In case l2arc_write_done() handles a zio that was not successful check
that the list of log block pointers is not empty when restoring them
in the device header. Otherwise zero them out. In any case perform the
actual write updating the device header after the zio of
l2arc_write_buffers() completes as l2arc_write_done() may have touched
the memory holding the log block pointers in the device header.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10540 
Closes #10543
2020-07-10 14:10:03 -07:00
Mark Johnston
6e00561712 Add a "try" operation for range locks
zfs_rangelock_tryenter() bails immediately instead of waiting for the
lock to become available.  This will be used to resolve a deadlock in
the FreeBSD page-in code.  No functional change intended.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #10519
2020-07-06 11:53:31 -07:00
Brian Behlendorf
9a49d3f3d3
Add device rebuild feature
The device_rebuild feature enables sequential reconstruction when
resilvering.  Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices.  However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.

The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.

    zpool attach -s <pool> <existing vdev> <new vdev>
    zpool replace -s <pool> <old vdev> <new vdev>

The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.

The `zpool wait -t resilver` command was extended to wait on
sequential resilvers.  From this perspective they are no different
than healing resilvers.

Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.

As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory.  Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.

Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10349
2020-07-03 11:05:50 -07:00
Matthew Macy
7ddb753d17
freebsd: changes necessary to coexist with dtrace in tree
Fix header conflicts when building zfs with openzfs as a vendor import.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10497
2020-07-01 09:10:08 -07:00
Matthew Ahrens
3c42c9ed84
Clean up OS-specific ARC and kmem code
OS-specific code (e.g. under `module/os/linux`) does not need to share
its code structure with any other operating systems.  In particular, the
ARC and kmem code need not be similar to the code in illumos, because we
won't be syncing this OS-specific code between operating systems.  For
example, if/when illumos support is added to the common repo, we would
add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of
this code.

Therefore, we can simplify the code in the OS-specific ARC and kmem
routines.

These changes do not impact system behavior, they are purely code
cleanup.  The changes are:

Arenas are not used on Linux or FreeBSD (they are always `NULL`), so
`heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along
with code that uses them.

In `arc_available_memory()`:
 * `desfree` is unused, remove it
 * rename `freemem` to avoid conflict with pre-existing `#define`
 * remove checks related to arenas
 * use units of bytes, rather than converting from bytes to pages and
   then back to bytes

`SPL_KMEM_CACHE_REAP` is unused, remove it.

`skc_reap` is unused, remove it.

The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove
it.

`vmem_size()` and associated type and macros are unused, remove them.

In `arc_memory_throttle()`, use a less confusing variable name to store
the result of `arc_free_memory()`.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10499
2020-06-29 09:01:07 -07:00
Matthew Ahrens
67c0f0dedc
ARC shrinking blocks reads/writes
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5527, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5527.

Note: commit 3ec34e5527 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes #10496
2020-06-26 10:42:27 -07:00
Ryan Moeller
9192f27c1d
Add zfs_multihost_interval tunable handler for FreeBSD
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.

Add the handler so the tunable can be declared in common code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10490
2020-06-23 13:32:42 -07:00
Arvind Sankar
0ce2de637b Add prototypes
Add prototypes/move prototypes to header files.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:32 -07:00
Arvind Sankar
60356b1a21 Add include files for prototypes
Include the header with prototypes in the file that provides definitions
as well, to catch any mismatch between prototype and definition.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:25 -07:00
Arvind Sankar
c3fe42aabd Remove dead code
Delete unused functions.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:18 -07:00
Arvind Sankar
65c7cc49bf Mark functions as static
Mark functions used only in the same translation unit as static. This
only includes functions that do not have a prototype in a header file
either.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:20:38 -07:00
Matthew Macy
8056a75672
Disambiguate condvar API contract
On Illumos callers of cv_timedwait and cv_timedwait_hires
can't distinguish between whether or not the cv was signaled
or the call timed out. Illumos handles this (for some definition
of handles) by calling cv_signal in the return path if we were
signaled but the return value indicates instead that we timed
out. This would make sense if it were possible to query the the
cv for its net signal disposition. However, this isn't possible
and, in spite of the fact that there are places in the code that
clearly take a different and incompatible path if a timeout value
is indicated, this distinction appears to be rather subtle to most
developers. This problem is further compounded by the fact that on
Linux, calling cv_signal in the return path wouldn't even do the
right thing unless there are other waiters.

Since it is possible for the caller to independently determine how
much time is remaining but it is not possible to query if the cv
was in fact signaled, prioritizing signalling over timeout seems
like a cleaner solution. In addition, judging from usage patterns
within the code itself, it is also less error prone.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10471
2020-06-18 10:17:50 -07:00
Matthew Macy
7564073ed6
Add abd_cache_reap_now for abd_chunk_cache users
Apparently missed in the initial port integration was
the need to reap the abd_chunk_cache on FreeBSD. This
change addresses that oversight.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10474
2020-06-17 21:44:13 -07:00
Jorgen Lundman
4458157bee
zfs_ioctl: saved_poolname can be truncated
As it uses kmem_strdup() and kmem_strfree() which both rely on
strlen() being the same, but saved_poolname can be truncated causing:

SPL: kernel memory allocator:
buffer freed to wrong cache
SPL: buffer was allocated from kmem_alloc_16,
SPL: caller attempting free to kmem_alloc_8.
SPL: buffer=0xffffff90acc66a38  bufctl=0x0  cache: kmem_alloc_8

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10469
2020-06-17 14:30:03 -07:00
Alexander Motin
17ca30185a
Set initial arc_c to arc_c_min instead of arc_c_max
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10437
2020-06-17 14:27:04 -07:00
Jorgen Lundman
883a40fff4
Add convenience wrappers for common uio usage
The macOS uio struct is opaque and the API must be used, this
makes the smallest changes to the code for all platforms.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10412
2020-06-14 10:09:55 -07:00
Jorgen Lundman
4f73576ea1
Upstream: zil_commit_waiter() can stall forever
On macOS clock_t is unsigned, so when cv_timedwait_hires() returns -1
we loop forever. The conditional was tweaked to ignore signedness.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10445
2020-06-14 10:08:21 -07:00
Arvind Sankar
71504277ae Cleanup linux module kbuild files
The linux module can be built either as an external module, or compiled
into the kernel, using copy-builtin. The source and build directories
are slightly different between the two cases, and currently, compiling
into the kernel still refers to some files from the configured ZFS
source tree, instead of the copies inside the kernel source tree. There
is also duplication between copy-builtin, which creates a Kbuild file to
build ZFS inside the kernel tree, and the top-level module/Makefile.in.

Fix this by moving the list of modules and the CFLAGS settings into a
new module/Kbuild.in, which will be used by the kernel kbuild
infrastructure, and using KBUILD_EXTMOD to distinguish the two cases
within the Makefiles, in order to choose appropriate include
directories etc.

Module CFLAGS setting is simplified by using subdir-ccflags-y (available
since 2.6.30) to set them in the top-level Kbuild instead of each
individual module. The disabling of -Wunused-but-set-variable is removed
from the lua and zfs modules. The variable that the Makefile uses is
actually not defined, so this has no effect; and the warning has long
been disabled by the kernel Makefile itself.

The target_cpu definition in module/{zfs,zcommon} is removed as it was
replaced by use of CONFIG_SPARC64 in
  commit 70835c5b75 ("Unify target_cpu handling")

os/linux/{spl,zfs} are removed from obj-m, as they are not modules in
themselves, but are included by the Makefile in the spl and zfs module
directories. The vestigial Makefiles in os and os/linux are removed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10379
Closes #10421
2020-06-10 09:24:15 -07:00
Andrea Gelmini
dd4bc569b9
Fix typos
Correct various typos in the comments and tests.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #10423
2020-06-09 21:24:09 -07:00
Matthew Ahrens
7bcb7f0840
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224 
Closes #10383
2020-06-09 10:41:01 -07:00
George Amanakis
b7654bd794
Trim L2ARC
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
2020-06-09 10:15:08 -07:00
Pawel Jakub Dawidek
529246df96
Restore support for in-kernel ZFS ioctls
In Illumos it is possible to call ioctl functions from within the
kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support
that, but it doesn't hurt to keep it around, as all the code is there.

Before this commit it was a dead code and zc_iflags was always zero.
Restore this functionality by allowing to pass a flag to the
zfsdev_ioctl_common() function.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #10417
2020-06-08 13:57:22 -07:00
Jorgen Lundman
c9e319faae
Replace sprintf()->snprintf() and strcpy()->strlcpy()
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used.  If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().

The biggest change is adding a size parameter to zfs_id_to_fuidstr().

The various *_impl_get() functions are only used on linux and have
not yet been updated.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10400
2020-06-07 11:42:12 -07:00
Paul Dagnelie
99b281f1ae
Fix double mutex_init bug in send code
It was possible to cause a kernel panic in the send code by 
initializing an already-initialized mutex, if a record was created 
with type DATA, destroyed with a different type (bypassing the 
mutex_destroy call) and then re-allocated as a DATA record again.

We tweak the logic to not change the type of a record once it has 
been created, avoiding the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10374
2020-06-03 19:53:21 -07:00
Ryan Moeller
a9dcfac51c
Periodically update ARC kstats
FreeBSD needs arc_adjust_zthr to run periodically for kstats to be
updated.  A comment in the code suggests this may have been the
original intent in illumos as well:

c946d5a913/module/zfs/arc.c (L4697-L4700)

Create the thread with a 1 second timer.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10371
2020-06-03 09:52:38 -07:00
Jorgen Lundman
70a5fc0530
Memory leak in dsl_destroy_snapshots_nvl error case
The dsl_destroy_snapshots_nvl() function has an early error out, 
and temporary nvlists were not freed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10366
2020-05-26 16:13:41 -07:00
Brian Atkinson
fb822260b1
Gang ABD Type
Adding the gang ABD type, which allows for linear and scatter ABDs to
be chained together into a single ABD.

This can be used to avoid doing memory copies to/from ABDs. An example
of this can be found in vdev_queue.c in the vdev_queue_aggregate()
function.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian <bwa@clemson.edu>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10069
2020-05-20 18:06:09 -07:00
DeHackEd
57434abae6
Use boot_ncpus in place of max_ncpus in taskq_create
Due to hotplug support or BIOS bugs sometimes max_ncpus can be
an absurdly high value. I have a system with 32 cores/threads
but reports max_ncpus == 440. This many threads potentially
cripples the system during arc_prune floods for example.

boot_ncpus is the number of working CPUs when called so use
that instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #10282
2020-05-20 10:07:21 -07:00
Matthew Ahrens
1b9cd1a9d9
Fix error handling in receive_writer_thread()
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.

When an error is first encountered, this happens correctly.  However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record).  Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.

The fix is to only set `rwa->err` if we got a *non-zero* error.

This bug was introduced by #10099 "Improve zfs receive performance by
batching writes".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10320
2020-05-14 20:48:29 -07:00
Brian Behlendorf
2ade659eb4
Fix abd_enter/exit_critical wrappers
Commit fc551d7 introduced the wrappers abd_enter_critical() and
abd_exit_critical() to mark critical sections.  On Linux these are
implemented with the local_irq_save() and local_irq_restore() macros
which set the 'flags' argument when saving.  By wrapping them with
a function the local variable is no longer set by the macro and is
no longer properly restored.

Convert abd_enter_critical() and abd_exit_critical() to macros to
resolve this issue and ensure the flags are properly restored.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10332
2020-05-14 20:45:16 -07:00
Jorgen Lundman
eeb8fae9c7
Upstream: add missing thread_exit()
Undo FreeBSD wrapper for thread_create() added to call thread_exit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10314
2020-05-14 15:58:09 -07:00
Matthew Ahrens
8b240f14f9
remove unneeded member drc_err of dmu_recv_cookie_t
The member drc_err of dmu_recv_cookie_t is used only locally in
receive_read, so we can replace it with a local variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10319
2020-05-14 12:10:29 -07:00
John Poduska
41035a0496
Resilver restarts unnecessarily when it encounters errors
When a resilver finishes, vdev_dtl_reassess is called to hopefully
excise DTL_MISSING (amongst other things). If there are errors during
the resilver, they are tracked in DTL_SCRUB, as spelled out in the
block comment in vdev.c. DTL_SCRUB is in-core only, so it can only
be used if the pool was online for the whole resilver. This state is
tracked with the spa_scrub_started flag, which only gets set when
the scan is initialized. Unfortunately, this flag gets cleared right
before vdev_dtl_reassess gets called, so if there are any errors
during the scan, DTL_MISSING will never get excised and the resilver
will just continually restart. This fix simply moves clearing that
flag until after the call to vdev_dtl_reasses.

In addition, if a pool is imported and already has scn_errors > 0,
this change will restart the resilver immediately instead of doing
the rest of the scan and then restarting it from the beginning. On
the other hand, if scn_errors == 0 at import, then no errors have
been encountered so far, so the spa_scrub_started flag can be safely
set.

A test has been added to verify that resilver does not restart when
relevant DTL's are available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #10291
2020-05-13 10:54:27 -07:00
Brian Atkinson
fc551d7efb
Combine OS-independent ABD Code into Common Source File
Reorganizing ABD code base so OS-independent ABD code has been placed
into a common abd.c file. OS-dependent ABD code has been left in each
OS's ABD source files, and these source files have been renamed to
abd_os.

The OS-independent ABD code is now under:
module/zfs/abd.c
With the OS-dependent code in:
module/os/linux/zfs/abd_os.c
module/os/freebsd/zfs/abd_os.c

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10293
2020-05-10 12:23:52 -07:00
George Amanakis
657fd33bcf
Improvements on persistent L2ARC
Functional changes:

We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.

1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
   reflect the total aligned size of log blocks on the device. This is
   also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
   which reflect the total number of L2ARC log blocks present on cache
   devices.  It is also reflected in the header of the cache device as
   "dh_lb_count".

In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.

Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.

If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().

If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.

In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.

Non-functional changes:

Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.

Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.

Fix inaccurate comments for example in l2arc_log_blk_restore().

Add asserts at the end in l2arc_evict() and l2arc_write_buffers().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10228
2020-05-07 16:34:03 -07:00
Paul Dagnelie
108a454a46
Add support for boot environment data to be stored in the label
Modern bootloaders leverage data stored in the root filesystem to 
enable some of their powerful features. GRUB specifically has a grubenv 
file which can store large amounts of configuration data that can be 
read and written at boot time and during normal operation. This allows 
sysadmins to configure useful features like automated failover after 
failed boot attempts. Unfortunately, due to the Copy-on-Write nature 
of ZFS, the standard behavior of these tools cannot handle writing to
ZFS files safely at boot time. We need an alternative way to store 
data that allows the bootloader to make changes to the data.

This work is very similar to work that was done on Illumos to enable 
similar functionality in the FreeBSD bootloader. This patch is different 
in that the data being stored is a raw grubenv file; this file can store 
arbitrary variables and values, and the scripting provided by grub is 
powerful enough that special structures are not required to implement 
advanced behavior.

We repurpose the second padding area in each label to store the grubenv 
file, protected by an embedded checksum. We add two ioctls to get and 
set this data, and libzfs_core and libzfs functions to access them more 
easily. There are no direct command line interfaces to these functions; 
these will be added directly to the bootloader utilities.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10009
2020-05-07 09:36:33 -07:00
George Amanakis
1b664952ae
Enable splitting mirrors with indirect vdevs
When a top-level vdev is removed from a pool it is converted to an
indirect vdev. Until now splitting such mirrored pools was not possible
with zpool split. This patch enables handling of indirect vdevs and
splitting of those pools with zpool split.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10283
2020-05-06 10:32:28 -07:00
George Amanakis
fa25460538
Add missing zfs_refcount_destroy() in key_mapping_rele()
Otherwise when running with reference_tracking_enable=TRUE mounting
and unmounting an encrypted dataset panics with:

Call Trace:
 dump_stack+0x66/0x90
 slab_err+0xcd/0xf2
 ? __kmalloc+0x174/0x260
 ? __kmem_cache_shutdown+0x158/0x240
 __kmem_cache_shutdown.cold+0x1d/0x115
 shutdown_cache+0x11/0x140
 kmem_cache_destroy+0x210/0x230
 spl_kmem_cache_destroy+0x122/0x3e0 [spl]
 zfs_refcount_fini+0x11/0x20 [zfs]
 spa_fini+0x4b/0x120 [zfs]
 zfs_kmod_fini+0x6b/0xa0 [zfs]
 _fini+0xa/0x68c [zfs]
 __x64_sys_delete_module+0x19c/0x2b0
 do_syscall_64+0x5b/0x1a0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10246
2020-04-28 09:53:45 -07:00
Tom Caputi
aa646323db
Fix missing ivset guid with resumed raw base recv
This patch corrects a bug introduced in 61152d1069. When
resuming a raw base receive, the dmu_recv code always sets
drc->drc_fromsnapobj to the object ID of the previous
snapshot. For incrementals, this is correct, but for base
sends, this should be left at 0. The presence of this ID
eventually allows a check to run which determines whether
or not the incoming stream and the previous snapshot have
matching IVset guids. This check fails becuase it is not
meant to run when there is no previous snapshot. When it
does fail, the user receives an error stating that the
incoming stream has the problem outlined in errata 4.

This patch corrects this issue by simply ensuring
drc->drc_fromsnapobj is left as 0 for base receives.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #10234 
Closes #10239
2020-04-24 19:00:32 -07:00
Matthew Ahrens
196bee4cfd
Remove deduplicated send/receive code
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156
Closes #10212
2020-04-23 10:06:57 -07:00
Matthew Ahrens
32d805c3e2
Use a struct to organize metaslab-group-allocator fields
Each metaslab group (of which there is one per top-level vdev) has
several (4, by default) "metaslab group allocators".  Each "allocator"
has its own metaslab that it prefers to allocate from (the "primary"
allocator), and each can perform allocations concurrently with the other
allocators.  In addition to the primary metaslab, there are several
other fields that need to be tracked separately for each allocator.
These are currently stored as several arrays in the metaslab_group_t,
each array indexed by allocator number.

This change organizes all the metaslab-group-allocator-specific fields
into a new struct, metaslab_group_allocator_t.  The metaslab_group_t now
needs only one array indexed by the allocator number - which contains
the metaslab_group_allocator_t's.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10213
2020-04-22 10:26:56 -07:00
Matthew Ahrens
1f043c8be1
Fix zfs send progress reporting
The progress of a send is supposed to be reported by `zfs send -v`, but
it is not.  This works by creating a new user thread (with
pthread_create()) which does ZFS_IOC_SEND_PROGRESS ioctls to check how
much progress has been made.  This IOCTL finds the specified send (since
there may be multiple concurrent sends in the system).  The IOCTL also
checks that the specified send was started by the current process.

On Linux, different threads of the same process are represented as
different `struct task_struct`s (and, confusingly, have different
PID's).  To check if if two threads are in the same process, we need to
check if they have the same `struct task_struct:group_leader`.

We used to to this correctly, but it was inadvertently changed by
30af21b025 (Redacted Send) to simply check if the current
`struct task_struct` is the one that started the send.

This commit changes the code back to checking if the send was started by
a `struct task_struct` with the same `group_leader` as the calling
thread.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Chris Wedgwood <cw@f00f.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10215 
Closes #10216
2020-04-20 10:12:48 -07:00
George Amanakis
9249f1272e
Persistent L2ARC minor fixes
Minor fixes on persistent L2ARC improving code readability and fixing 
a typo in zdb.c when byte-swapping a log block. It also improves the 
pesist_l2arc_007_pos.ksh test by giving it more time to retrieve log 
blocks on the cache device.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10210
2020-04-17 09:27:40 -07:00
Ryan Moeller
a7929f3137
Update FreeBSD tunables
Remove some obsolete legacy compat, rename some misnamed, and add some
missing tunables for FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10203
2020-04-15 11:14:47 -07:00
Brian Behlendorf
791e480c6a
Disable user space reference tracking
The memory and cpu cost of reference count tracking with the current
implementation is significant.  For this reason it has always been
disabled by default for the kmods.  Apply this same default to user
space so ztest doesn't always incur this performance penalty.

Our intention is to re-enable this by default for ztest once the code
has been optimized.  Since we expect to at some point provide a FUSE
implementation we wouldn't want this enabled by default for libzpool.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10189
2020-04-13 10:51:44 -07:00
George Amanakis
77f6826b83
Persistent L2ARC
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #925 
Closes #1823 
Closes #2672 
Closes #3744 
Closes #9582
2020-04-10 10:33:35 -07:00
Ryan Moeller
36a6e2335c
Don't ignore zfs_arc_max below allmem/32
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.

Add warning messages when tunables are being ignored.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10157
Closes #10158
2020-04-09 15:39:48 -07:00
Matthew Macy
8b27e08ed8
Add separate field for indicating that spa is in middle of split
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10178
2020-04-09 09:59:31 -07:00
Matthew Macy
01c4f2bf29
Use vn_io_fault_uiomove on FreeBSD to avoid potential deadlock
Added to prevent a possible deadlock, the following comments from
FreeBSD explain the issue.  The comment describing vn_io_fault_uiomove:

/*
 * Helper function to perform the requested uiomove operation using
 * the held pages for io->uio_iov[0].iov_base buffer instead of
 * copyin/copyout.  Access to the pages with uiomove_fromphys()
 * instead of iov_base prevents page faults that could occur due to
 * pmap_collect() invalidating the mapping created by
 * vm_fault_quick_hold_pages(), or pageout daemon, page laundry or
 * object cleanup revoking the write access from page mappings.
 *
 * Filesystems specified MNTK_NO_IOPF shall use vn_io_fault_uiomove()
 * instead of plain uiomove().
 */

This used for vn_io_fault which has the following motivation:

/*
 * The vn_io_fault() is a wrapper around vn_read() and vn_write() to
 * prevent the following deadlock:
 *
 * Assume that the thread A reads from the vnode vp1 into userspace
 * buffer buf1 backed by the pages of vnode vp2.  If a page in buf1 is
 * currently not resident, then system ends up with the call chain
 *   vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
 *     vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
 * which establishes lock order vp1->vn_lock, then vp2->vn_lock.
 * If, at the same time, thread B reads from vnode vp2 into buffer buf2
 * backed by the pages of vnode vp1, and some page in buf2 is not
 * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
 *
 * To prevent the lock order reversal and deadlock, vn_io_fault() does
 * not allow page faults to happen during VOP_READ() or VOP_WRITE().
 * Instead, it first tries to do the whole range i/o with pagefaults
 * disabled. If all pages in the i/o buffer are resident and mapped,
 * VOP will succeed (ignoring the genuine filesystem errors).
 * Otherwise, we get back EFAULT, and vn_io_fault() falls back to do
 * i/o in chunks, with all pages in the chunk prefaulted and held
 * using vm_fault_quick_hold_pages().
 *
 * Filesystems using this deadlock avoidance scheme should use the
 * array of the held pages from uio, saved in the curthread->td_ma,
 * instead of doing uiomove().  A helper function
 * vn_io_fault_uiomove() converts uiomove request into
 * uiomove_fromphys() over td_ma array.
 *
 * Since vnode locks do not cover the whole i/o anymore, rangelocks
 * make the current i/o request atomic with respect to other i/os and
 * truncations.
 */

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10177
2020-04-08 10:30:27 -07:00
Ryan Moeller
7e3df9db12
Finish refactoring for ZFS_MODULE_PARAM_CALL
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.

To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.

With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10179
2020-04-07 10:06:22 -07:00
Paul Dagnelie
5a42ef04fd
Add 'zfs wait' command
Add a mechanism to wait for delete queue to drain.

When doing redacted send/recv, many workflows involve deleting files 
that contain sensitive data. Because of the way zfs handles file 
deletions, snapshots taken quickly after a rm operation can sometimes 
still contain the file in question, especially if the file is very 
large. This can result in issues for redacted send/recv users who 
expect the deleted files to be redacted in the send streams, and not 
appear in their clones.

This change duplicates much of the zpool wait related logic into a 
zfs wait command, which can be used to wait until the internal
deleteq has been drained.  Additional wait activities may be added 
in the future. 

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9707
2020-04-01 10:02:06 -07:00
George Amanakis
37c22948e5
Reset l2ad_hand and l2ad_first in l2arc_evict
Increasing l2arc_write_size or l2arc_write_boost can result in
l2arc_write_buffers() not having enough space to perform its writes and
panic zio_write_phys().

Instead of resetting l2ad_hand to l2ad_start at the end of
l2arc_write_buffers() and not taking into account a possible
user-mediated increase of l2arc_write_max, we do this in l2arc_evict(),
right after l2arc_write_size() has run. If there is not enough space to
evict (ie we will exceed l2ad_end) we evict to the end of the device,
reset l2ad_hand to l2ad_start, set l2ad_first to 0 and iterate
l2arc_evict(). We avoid infinite iteration of l2arc_evict() by making
sure in l2arc_write_size() that l2ad_start + size does not exceed
l2ad_end.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10154
2020-03-31 10:46:48 -07:00
Ryan Moeller
9a51738b60
Let default arc_c_max be platform dependent
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator.  Other platforms
do not require the same logic.

Implement an arc_default_max() function to determine a default max ARC
size in platform code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10155
2020-03-27 09:14:46 -07:00
Matthew Ahrens
3f38797338
Compile cityhash code into libzfs
Make the cityhash code compile into libzfs, in preparation for the new
"zstream" command.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10152
2020-03-27 09:11:22 -07:00
Paul Dagnelie
7145123b0a
Separate warning for incomplete and corrupt streams
This change adds a separate return code to zfs_ioc_recv that is used 
for incomplete streams, in addition to the existing return code for 
streams that contain corruption.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10122
2020-03-17 10:30:33 -07:00
Matthew Ahrens
7261fc2e81
Improve zfs receive performance by batching writes
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object.  If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant.  For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).

To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all.  By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).

This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10099
2020-03-16 11:51:56 -07:00
Matthew Ahrens
0fdd6106bb
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock.  For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it.  This may lead to deadlock, since the lock order is
reversed.

Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock.  However, there are a few places in
the send and receive code paths that do not.  For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.

This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock.  In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t).  Thus we do not need to acquire the dp_config_rwlock in
many new places.

I also cleaned up code in dmu_redact_snap() and send_traverse_thread().

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 10:55:02 -07:00
Alexander Motin
fa130e010c
Fix infinite scan on a pool with only special allocations
Attempt to run scrub or resilver on a new pool containing only special
allocations (special vdev added on creation) caused infinite loop
because of dsl_scan_should_clear() limiting memory usage to 5% of pool
size, which it calculated accounting only normal allocation class.

Addition of special and just in case dedup classes fixes the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10106 
Closes #8694
2020-03-12 10:52:03 -07:00
John Poduska
e6b28efccc
Prevent race condition in dnode_dest (#10101)
dnode_special_close() waits for the refcount of dn_holds to go to zero
without holding the dn_mtx. dnode_rele_and_unlock() does the final
remove to dn_holds with dn_mtx being held:

	refs = zfs_refcount_remove(&dn->dn_holds, tag);
	mutex_exit(&dn->dn_mtx);

So, there is a race condition after the remove until dn_mtx is
dropped. During that time, dnode_destroy() can get called, which ends
up in dnode_dest() calling mutex_destroy() and a panic since the lock
is still held.

This change adds a condvar to wait for the final dnode_rele_and_unlock()
to release the dn_mtx before calling dnode_destroy().

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #7814
Closes #10101
2020-03-12 10:25:56 -07:00
Mark Roper
1e9231ada8
Prevent deadlock in arc_read in Linux memory reclaim callback
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:

    __schedule at ffffffff81610f2e
    schedule at ffffffff81611558
    schedule_preempt_disabled at ffffffff8161184a
    __mutex_lock at ffffffff816131e8
    arc_buf_destroy at ffffffffa0bf37d7 [zfs]
    dbuf_destroy at ffffffffa0bfa6fe [zfs]
    dbuf_evict_one at ffffffffa0bfaa96 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
    dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
    osd_object_delete at ffffffffa0b64ecc [osd_zfs]
    lu_object_free at ffffffffa06d6a74 [obdclass]
    lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
    lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
    shrink_slab at ffffffff811ca9d8
    shrink_node at ffffffff811cfd94
    do_try_to_free_pages at ffffffff811cfe63
    try_to_free_pages at ffffffff811d01c4
    __alloc_pages_slowpath at ffffffff811be7f2
    __alloc_pages_nodemask at ffffffff811bf3ed
    new_slab at ffffffff81226304
    ___slab_alloc at ffffffff812272ab
    __slab_alloc at ffffffff8122740c
    kmem_cache_alloc at ffffffff81227578
    spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
    arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
    arc_read at ffffffffa0bf0924 [zfs]
    dbuf_read at ffffffffa0bf9083 [zfs]
    dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes #9987
2020-03-12 10:24:43 -07:00
Matthew Ahrens
1dc32a67e9
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads.  This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks.  Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.

The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread.  It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`.  This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU.  It also induces later
costs by other threads:

 * Since the data was only prefetched, dmu_send()->dmu_dump_write() will
   need to call arc_read() again to get the data.  This will have to
   look up the arc_hdr in the hash table and copy the data from the
   scatter ABD in the arc_hdr to a linear ABD in arc_buf.  This takes
   27% of one CPU.

 * dmu_dump_write() needs to arc_buf_destroy()  This takes 11% of one
   CPU.

 * arc_adjust() will need to evict this arc_hdr, taking about 50% of one
   CPU.

All of these costs can be avoided by bypassing the ARC if the data is
not already cached.  This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.

The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second.  This change
increases the metric by 50%, from ~100,000 to ~150,000.  When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).

In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache.  In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 10:51:04 -07:00
Brian Behlendorf
f49db9b504
zio: dprintf_bp() if errors > 0 in zfs_blkptr_verify()
Also dprintf_bp() in case BLK_VERIFY_HALT of zfs_blkptr_verify_log()
since dprintf_bp() in zfs_blkptr_verify() will never be executed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Justin Keogh <commits@v6y.net>
Closes #10086
2020-03-04 15:08:41 -08:00
Brian Behlendorf
2288d41968
Add trim support to zpool wait
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #10071
2020-03-04 15:07:11 -08:00
Matthew Ahrens
b3212d2fa6
Improve performance of zio_taskq_member
__zio_execute() calls zio_taskq_member() to determine if we are running
in a zio interrupt taskq, in which case we may need to switch to
processing this zio in a zio issue taskq.  The call to
zio_taskq_member() can become a performance bottleneck when we are
processing a high rate of zio's.

zio_taskq_member() calls taskq_member() on each of the zio interrupt
taskqs, of which there are 21.  This is slow because each call to
taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively
slow.

This commit improves the performance of zio_taskq_member() by having it
cache the value of tsd_get(taskq_tsd), reducing the number of those
calls to 1/21th of the current behavior.

In a test case running `zfs send -c >/dev/null` of a filesystem with
small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of
one CPU, and with this change it is reduced to 1.3%.  Overall time to
perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000
blocks/sec).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10070
2020-03-03 10:29:38 -08:00
Ryan Moeller
9bb907bc3f
Make spa_history_zone platform-dependent in kernel
This function should only return "linux" on Linux.

Move the kernel part of the function out of common code.
Fix the tests for FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10079
2020-03-02 09:43:30 -08:00
Matthew Macy
cf118ae8dc
Don't call zrele on passed zp in zfs_xattr_owner_unlinked on FreeBSD
FreeBSD has a somewhat more cumbersome locking and refcounting
protocol for the platform counterpart to znode. We need to not call
zrele on the passed zp, but do need to do so on any intermediate zp.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10075
2020-02-28 14:53:18 -08:00
Matthew Macy
ae9f92f6f3
Re-share zfsdev_getminor and zfs_onexit_fd_hold
By adding a zfs_file_private accessor to the common
interfaces and some extensions to FreeBSD platform
code it is now possible to share the implementations
for the aforementioned functions.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10073
2020-02-28 14:50:32 -08:00
Matthew Ahrens
9cdf7b1f6b
Improve zfs destroy performance with zio_t-free zio_free()
When "zfs destroy" is run, it completes quickly, and in the background
we locate the blocks to free and free them.  This background activity
can be observed with `zpool get freeing` and `zpool wait -t free ...`.

This background activity is processed by a single thread (the spa_sync
thread) which calls zio_free() on each of the blocks to free.  With even
modest storage performance, the CPU consumption of zio_free() can be the
performance bottleneck.

Performance of zio_free() can be improved by not actually creating a
zio_t in the common case (non-dedup, non-gang), instead calling
metaslab_free() directly.  This avoids the CPU cost of allocating the
zio_t, and more importantly the cost of adding and later removing this
zio_t from the parent zio's child list.

The result is that performance of background freeing more than doubles,
from 0.6 million blocks per second to 1.3 million blocks per second.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10034
2020-02-28 14:49:44 -08:00
Matthew Macy
13fac09868
Consolidate arc_buf allocation checks
The following check currently occurs in three separate locations
in dbuf.c.  This change consolidates those checks in to the
dbuf_alloc_arcbuf_from_arcbuf() function.

if (arc_is_encrypted(data)) {
...
} else if (compress_type != ZIO_COMPRESS_OFF) {
...
} else {
...
}

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10057
2020-02-27 17:12:44 -08:00
Brian Behlendorf
2c3a83701d Linux 5.6 compat: time_t
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired.  Callers are now required to use the time64_t type.

Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code.  They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.

Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.

Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10052
Closes #10064
2020-02-27 09:31:02 -08:00
Matthew Macy
28caa74b19
Refactor dnode dirty context from dbuf_dirty
* Add dedicated donde_set_dirtyctx routine.
* Add empty dirty record on destroy assertion.
* Make much more extensive use of the SET_ERROR macro.

Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9924
2020-02-26 16:09:17 -08:00
Matthew Macy
c6a6b4d50a
Remove dead code error handling from dsl_crypt.c
Sleepable (KM_SLEEP) allocations cannot fail. Hence
error handling for them is not useful.

Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10031
2020-02-25 15:59:29 -08:00
Dirkjan Bussink
327000ce04
Remove zfs_getattr and convoff dead code
The `convoff` function is called only in one code path in `zfs_space`.
Each caller of `zfs_space` is called with a `flock64_t` that has
`l_whence` set to `SEEK_SET`. This means that `convoff` always results
in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and
`int whence` is `SEEK_SET` as well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes #10006
2020-02-24 15:38:22 -08:00
Matthew Ahrens
31a69fbccb
Remove unused structs and members in dmu_send.c
There are several structs (and members of structs) related to redaction,
which are no longer used.  This commit removes them.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10039
2020-02-24 09:50:14 -08:00
Ryan Moeller
5f087dda78
Enable zpool events tunables and tests on FreeBSD
We have have made the necessary changes in our module code to expose
zevents through both devd and the zpool events ioctl. Now the tunables
can be exposed and zpool events tests can be enabled on both platforms.

A few minor tweaks to the tests were needed to accommodate the way wc
formats output on FreeBSD.

zed remains to be ported.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10008
2020-02-18 11:22:56 -08:00
Matthew Macy
8b3547a481
Factor out some dbuf subroutines and add state change tracing
Create dedicated dbuf_read_hole and dbuf_read_bonus.
Additionally, add a dtrace probe to allow state change tracing.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Authored-by: Will Andrews <wca@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9923
2020-02-18 11:21:37 -08:00
Jason King
13b5a4d5c0
Support setting user properties in a channel program
This adds support for setting user properties in a
zfs channel program by adding 'zfs.sync.set_prop'
and 'zfs.check.set_prop' to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Contributions-by: Jason King <jason.king@joyent.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9950
2020-02-14 13:41:42 -08:00
Matthew Ahrens
4fe3a842bb
Remove limit on number of async zio_frees of non-dedup blocks
The module parameter zfs_async_block_max_blocks limits the number of
blocks that can be freed by the background freeing of filesystems and
snapshots (from "zfs destroy"), in one TXG.  This is useful when freeing
dedup blocks, becuase each zio_free() of a dedup block can require an
i/o to read the relevant part of the dedup table (DDT), and will also
dirty that block.

zfs_async_block_max_blocks is set to 100,000 by default.  For the more
typical case where dedup is not used, this can have a negative
performance impact on the rate of background freeing (from "zfs
destroy").  For example, with recordsize=8k, and TXG's syncing once
every 5 seconds, we can free only 160MB of data per second, which may be
much less than the rate we can write data.

This change increases zfs_async_block_max_blocks to be unlimited by
default.  To address the dedup freeing issue, a new tunable is
introduced, zfs_max_async_dedup_frees, which limits the number of
zio_free()'s of dedup blocks done by background destroys, per txg.  The
default is 100,000 free's (same as the old zfs_async_block_max_blocks
default).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10000
2020-02-14 08:39:46 -08:00
Alexander Motin
465e4e795e
Remove duplicate dbufs accounting
Since AVL already has embedded element counter, use dn_dbufs_count
only for dbufs not counted there (bonus buffers) and just add them.
This removes two atomics per dbuf life cycle.

According to profiler it reduces time spent by dbuf_destroy() inside
bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #9949
2020-02-13 11:20:42 -08:00
Christian Schwarz
948f0c4419 zcp: add zfs.sync.bookmark
Add support for bookmark creation and cloning.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:17 -08:00
Christian Schwarz
a73f361fdb Implement bookmark copying
This feature allows copying existing bookmarks using

    zfs bookmark fs#target fs#newbookmark

There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.

Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).

ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.

Overview:

- Terminology:
    - source = existing snapshot or bookmark
    - new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
  - create new bookmark node
  - copy source's `zbn_phys` to new's `zbn_phys`
  - zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
  bookmarks as sources
  - => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
  and bookmark sources
- Adjust CLI
  - refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
  - warn about redaction bookmark handling
- Add test cases
  - CLI
  - pyyzfs libzfs_core bindings

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #9571
2020-02-11 13:19:12 -08:00
Matthew Macy
7b49bbc816
Address Coverity warnings in #9902
Coverity reports the variable may be NULL, but due to the
way the dirty records are handled this cannot be the case.
Add a comment and VERIFY to make this clear and silence
the warning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #9962
2020-02-11 13:12:41 -08:00
Brian Behlendorf
dceeca5bbd
Add missing dmu_buf_unlock_parent() calls to dbuf_read_impl()
As explained by the comment in dbuf_read() and above dbuf_read_impl().
Under all circumstances the parent lock specified by dblt should be
dropped when existing dbuf_read_impl().  This was not being done for
two exist paths.  Additionally, ensure the mutex is unlocked before
dropping the parent lock.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9968
2020-02-10 14:54:12 -08:00