ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist
With a compression= property outside of the understood range:
Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.
After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10631
A collection of header changes to enable FreeBSD to build
with vendored OpenZFS.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10635
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10600
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10636
When a clone is promoted, its livelist is no longer accurate, so it is
discarded. If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.
Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B
If we promote C, it discards C's livelist. It should discard B's
livelist, but that is not happening. The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C. The incorrectly-freed blocks can be reallocated causing
checksum errors. And when C is destroyed it can double-free the
incorrectly-freed blocks.
The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir. So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed. As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10652
In `vdev_load()`, we look up several entries in the `vdev_top_zap`
object. In most cases, if we encounter an i/o error, it will be
returned to the caller. However, when handling
`VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue
on, which in theory could cause us to not realize that a vdev should be
used only for `special` allocations.
In practice, if we encountered an i/o error while looking for
`VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an
i/o error while looking for other entries in the same object, and thus
the zpool open/import would fail. Therefore the impact of this problem
is negligible.
This commit adds error handling for i/o errors while accessing the
`vdev_top_zap`, so that we aren't relying on unrelated code to fail for
us.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10637
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10620
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#10630
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10621
This is a step toward being able to vendor the OpenZFS code in FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10625
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10623
By design a gang ABD can not have another gang ABD as a child. This is
to make sure the logical offset in a gang ABD is consistent with the
individual ABDS it contains as children. If a gang ABD is added as a
child of a gang ABD we will add the individual children of the gang ABD
to the parent gang ABD. This allows for a consistent view of offsets
within the parent gang ABD.
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10430
Set the initial max sizes to ULONG_MAX to allow the caches to grow
with the ARC.
Recalculate the metadata cache size on demand so it can adapt, too.
Update descriptions in zfs-module-parameters(5).
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10563Closes#10610
The process of evicting data from the ARC is referred to as
`arc_adjust`.
This commit changes the term to `arc_evict`, which is more specific.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10592
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure. This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function. This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.
The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it. Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.
The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer. Therefore, additionally waking it from the
shrinker calbacks has little benefit.
The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second. Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times. Therefore, this mechanism consumes
significant amounts of CPU time.
The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
1. Return free objects from the magazine layer to the slab layer
2. Return entirely-free slabs to the page layer (i.e. free pages)
These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches. The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.
These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine. So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.
We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected. Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.
This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10576
Livelists and spacemaps are data structures that are logs of allocations
and frees. Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.
These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).
ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab. However, it
generally halts processing when the error is detected.
When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.
The specific checks added are:
Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]
Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
spacemap log, then report any double FREEs and double ALLOCs
Verify that livelists are consistenet with spacemaps. The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes#10515
Our QE team during automated API testing hit deadlock in ZFS, caused
by lock order reversal. From one side dsl_sync_task_sync() locks
dp_config_rwlock as writer and calls spa_sync_props(), which waits
for spa_props_lock. From another spa_prop_get() locks spa_props_lock
and then calls dsl_pool_config_enter(), trying to lock dp_config_rwlock
as reader.
This patch makes spa_prop_get() lock dp_config_rwlock before
spa_props_lock, making the order consistent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#10553
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.
Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10511
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit." Two types of users are allowed to change
the limit:
1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`. This works properly.
2. A user with elevated system privileges (e.g. root). This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.
The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process. This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.
This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged. Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.
This problem only impacts Linux. Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8226Closes#10545
In case l2arc_write_done() handles a zio that was not successful check
that the list of log block pointers is not empty when restoring them
in the device header. Otherwise zero them out. In any case perform the
actual write updating the device header after the zio of
l2arc_write_buffers() completes as l2arc_write_done() may have touched
the memory holding the log block pointers in the device header.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10540Closes#10543
zfs_rangelock_tryenter() bails immediately instead of waiting for the
lock to become available. This will be used to resolve a deadlock in
the FreeBSD page-in code. No functional change intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#10519
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices. However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10349
Fix header conflicts when building zfs with openzfs as a vendor import.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10497
OS-specific code (e.g. under `module/os/linux`) does not need to share
its code structure with any other operating systems. In particular, the
ARC and kmem code need not be similar to the code in illumos, because we
won't be syncing this OS-specific code between operating systems. For
example, if/when illumos support is added to the common repo, we would
add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of
this code.
Therefore, we can simplify the code in the OS-specific ARC and kmem
routines.
These changes do not impact system behavior, they are purely code
cleanup. The changes are:
Arenas are not used on Linux or FreeBSD (they are always `NULL`), so
`heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along
with code that uses them.
In `arc_available_memory()`:
* `desfree` is unused, remove it
* rename `freemem` to avoid conflict with pre-existing `#define`
* remove checks related to arenas
* use units of bytes, rather than converting from bytes to pages and
then back to bytes
`SPL_KMEM_CACHE_REAP` is unused, remove it.
`skc_reap` is unused, remove it.
The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove
it.
`vmem_size()` and associated type and macros are unused, remove them.
In `arc_memory_throttle()`, use a less confusing variable name to store
the result of `arc_free_memory()`.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10499
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`. Before commit 3ec34e5527, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`. However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.
Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`. Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC. During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).
To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5527.
Note: commit 3ec34e5527 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.
Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.
To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes#10496
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.
Add the handler so the tunable can be declared in common code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10490
Include the header with prototypes in the file that provides definitions
as well, to catch any mismatch between prototype and definition.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10470
Mark functions used only in the same translation unit as static. This
only includes functions that do not have a prototype in a header file
either.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10470
On Illumos callers of cv_timedwait and cv_timedwait_hires
can't distinguish between whether or not the cv was signaled
or the call timed out. Illumos handles this (for some definition
of handles) by calling cv_signal in the return path if we were
signaled but the return value indicates instead that we timed
out. This would make sense if it were possible to query the the
cv for its net signal disposition. However, this isn't possible
and, in spite of the fact that there are places in the code that
clearly take a different and incompatible path if a timeout value
is indicated, this distinction appears to be rather subtle to most
developers. This problem is further compounded by the fact that on
Linux, calling cv_signal in the return path wouldn't even do the
right thing unless there are other waiters.
Since it is possible for the caller to independently determine how
much time is remaining but it is not possible to query if the cv
was in fact signaled, prioritizing signalling over timeout seems
like a cleaner solution. In addition, judging from usage patterns
within the code itself, it is also less error prone.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10471
Apparently missed in the initial port integration was
the need to reap the abd_chunk_cache on FreeBSD. This
change addresses that oversight.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10474
As it uses kmem_strdup() and kmem_strfree() which both rely on
strlen() being the same, but saved_poolname can be truncated causing:
SPL: kernel memory allocator:
buffer freed to wrong cache
SPL: buffer was allocated from kmem_alloc_16,
SPL: caller attempting free to kmem_alloc_8.
SPL: buffer=0xffffff90acc66a38 bufctl=0x0 cache: kmem_alloc_8
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10469
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure. I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return. All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.
The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.
Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion. It is also getting in sync with code in
arc_get_data_impl().
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#10437
The macOS uio struct is opaque and the API must be used, this
makes the smallest changes to the code for all platforms.
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10412
On macOS clock_t is unsigned, so when cv_timedwait_hires() returns -1
we loop forever. The conditional was tweaked to ignore signedness.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10445
The linux module can be built either as an external module, or compiled
into the kernel, using copy-builtin. The source and build directories
are slightly different between the two cases, and currently, compiling
into the kernel still refers to some files from the configured ZFS
source tree, instead of the copies inside the kernel source tree. There
is also duplication between copy-builtin, which creates a Kbuild file to
build ZFS inside the kernel tree, and the top-level module/Makefile.in.
Fix this by moving the list of modules and the CFLAGS settings into a
new module/Kbuild.in, which will be used by the kernel kbuild
infrastructure, and using KBUILD_EXTMOD to distinguish the two cases
within the Makefiles, in order to choose appropriate include
directories etc.
Module CFLAGS setting is simplified by using subdir-ccflags-y (available
since 2.6.30) to set them in the top-level Kbuild instead of each
individual module. The disabling of -Wunused-but-set-variable is removed
from the lua and zfs modules. The variable that the Makefile uses is
actually not defined, so this has no effect; and the warning has long
been disabled by the kernel Makefile itself.
The target_cpu definition in module/{zfs,zcommon} is removed as it was
replaced by use of CONFIG_SPARC64 in
commit 70835c5b75 ("Unify target_cpu handling")
os/linux/{spl,zfs} are removed from obj-m, as they are not modules in
themselves, but are included by the Makefile in the spl and zfs module
directories. The vestigial Makefiles in os and os/linux are removed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10379Closes#10421
Correct various typos in the comments and tests.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#10423
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#6224Closes#10383
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.
We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.
We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).
We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#9713Closes#9789Closes#10224
In Illumos it is possible to call ioctl functions from within the
kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support
that, but it doesn't hurt to keep it around, as all the code is there.
Before this commit it was a dead code and zc_iflags was always zero.
Restore this functionality by allowing to pass a flag to the
zfsdev_ioctl_common() function.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#10417
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used. If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().
The biggest change is adding a size parameter to zfs_id_to_fuidstr().
The various *_impl_get() functions are only used on linux and have
not yet been updated.
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10400
It was possible to cause a kernel panic in the send code by
initializing an already-initialized mutex, if a record was created
with type DATA, destroyed with a different type (bypassing the
mutex_destroy call) and then re-allocated as a DATA record again.
We tweak the logic to not change the type of a record once it has
been created, avoiding the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10374
FreeBSD needs arc_adjust_zthr to run periodically for kstats to be
updated. A comment in the code suggests this may have been the
original intent in illumos as well:
c946d5a913/module/zfs/arc.c (L4697-L4700)
Create the thread with a 1 second timer.
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10371
The dsl_destroy_snapshots_nvl() function has an early error out,
and temporary nvlists were not freed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10366
Adding the gang ABD type, which allows for linear and scatter ABDs to
be chained together into a single ABD.
This can be used to avoid doing memory copies to/from ABDs. An example
of this can be found in vdev_queue.c in the vdev_queue_aggregate()
function.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian <bwa@clemson.edu>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10069
Due to hotplug support or BIOS bugs sometimes max_ncpus can be
an absurdly high value. I have a system with 32 cores/threads
but reports max_ncpus == 440. This many threads potentially
cripples the system during arc_prune floods for example.
boot_ncpus is the number of working CPUs when called so use
that instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#10282
If `receive_writer_thread()` gets an error from `receive_process_record()`,
it should be saved in `rwa->err` so that we will stop processing records,
and the main thread will notice that the receive has failed.
When an error is first encountered, this happens correctly. However, if
there are more records to dequeue, the next time through the loop we
will reset `rwa->err` to zero, allowing us to try to process the
following record (2 after the failed record). Depending on what types
of records remain, we may incorrectly complete the receive
"successfully", but without actually having processed all the records.
The fix is to only set `rwa->err` if we got a *non-zero* error.
This bug was introduced by #10099 "Improve zfs receive performance by
batching writes".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10320
Commit fc551d7 introduced the wrappers abd_enter_critical() and
abd_exit_critical() to mark critical sections. On Linux these are
implemented with the local_irq_save() and local_irq_restore() macros
which set the 'flags' argument when saving. By wrapping them with
a function the local variable is no longer set by the macro and is
no longer properly restored.
Convert abd_enter_critical() and abd_exit_critical() to macros to
resolve this issue and ensure the flags are properly restored.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10332
The member drc_err of dmu_recv_cookie_t is used only locally in
receive_read, so we can replace it with a local variable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10319
When a resilver finishes, vdev_dtl_reassess is called to hopefully
excise DTL_MISSING (amongst other things). If there are errors during
the resilver, they are tracked in DTL_SCRUB, as spelled out in the
block comment in vdev.c. DTL_SCRUB is in-core only, so it can only
be used if the pool was online for the whole resilver. This state is
tracked with the spa_scrub_started flag, which only gets set when
the scan is initialized. Unfortunately, this flag gets cleared right
before vdev_dtl_reassess gets called, so if there are any errors
during the scan, DTL_MISSING will never get excised and the resilver
will just continually restart. This fix simply moves clearing that
flag until after the call to vdev_dtl_reasses.
In addition, if a pool is imported and already has scn_errors > 0,
this change will restart the resilver immediately instead of doing
the rest of the scan and then restarting it from the beginning. On
the other hand, if scn_errors == 0 at import, then no errors have
been encountered so far, so the spa_scrub_started flag can be safely
set.
A test has been added to verify that resilver does not restart when
relevant DTL's are available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes#10291
Reorganizing ABD code base so OS-independent ABD code has been placed
into a common abd.c file. OS-dependent ABD code has been left in each
OS's ABD source files, and these source files have been renamed to
abd_os.
The OS-independent ABD code is now under:
module/zfs/abd.c
With the OS-dependent code in:
module/os/linux/zfs/abd_os.c
module/os/freebsd/zfs/abd_os.c
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10293
Functional changes:
We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.
1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
reflect the total aligned size of log blocks on the device. This is
also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
which reflect the total number of L2ARC log blocks present on cache
devices. It is also reflected in the header of the cache device as
"dh_lb_count".
In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.
Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.
If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().
If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.
In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.
Non-functional changes:
Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.
Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.
Fix inaccurate comments for example in l2arc_log_blk_restore().
Add asserts at the end in l2arc_evict() and l2arc_write_buffers().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10228
Modern bootloaders leverage data stored in the root filesystem to
enable some of their powerful features. GRUB specifically has a grubenv
file which can store large amounts of configuration data that can be
read and written at boot time and during normal operation. This allows
sysadmins to configure useful features like automated failover after
failed boot attempts. Unfortunately, due to the Copy-on-Write nature
of ZFS, the standard behavior of these tools cannot handle writing to
ZFS files safely at boot time. We need an alternative way to store
data that allows the bootloader to make changes to the data.
This work is very similar to work that was done on Illumos to enable
similar functionality in the FreeBSD bootloader. This patch is different
in that the data being stored is a raw grubenv file; this file can store
arbitrary variables and values, and the scripting provided by grub is
powerful enough that special structures are not required to implement
advanced behavior.
We repurpose the second padding area in each label to store the grubenv
file, protected by an embedded checksum. We add two ioctls to get and
set this data, and libzfs_core and libzfs functions to access them more
easily. There are no direct command line interfaces to these functions;
these will be added directly to the bootloader utilities.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10009
When a top-level vdev is removed from a pool it is converted to an
indirect vdev. Until now splitting such mirrored pools was not possible
with zpool split. This patch enables handling of indirect vdevs and
splitting of those pools with zpool split.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10283
This patch corrects a bug introduced in 61152d1069. When
resuming a raw base receive, the dmu_recv code always sets
drc->drc_fromsnapobj to the object ID of the previous
snapshot. For incrementals, this is correct, but for base
sends, this should be left at 0. The presence of this ID
eventually allows a check to run which determines whether
or not the incoming stream and the previous snapshot have
matching IVset guids. This check fails becuase it is not
meant to run when there is no previous snapshot. When it
does fail, the user receives an error stating that the
incoming stream has the problem outlined in errata 4.
This patch corrects this issue by simply ensuring
drc->drc_fromsnapobj is left as 0 for base receives.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#10234Closes#10239
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated. Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.
This commit removes the code for sending and receiving deduplicated send
streams. `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream. `zfs receive` of
a deduplicated send stream will print an error message and fail.
The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.
Several new tests are added which leverage `zstream redup`.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156Closes#10212
Each metaslab group (of which there is one per top-level vdev) has
several (4, by default) "metaslab group allocators". Each "allocator"
has its own metaslab that it prefers to allocate from (the "primary"
allocator), and each can perform allocations concurrently with the other
allocators. In addition to the primary metaslab, there are several
other fields that need to be tracked separately for each allocator.
These are currently stored as several arrays in the metaslab_group_t,
each array indexed by allocator number.
This change organizes all the metaslab-group-allocator-specific fields
into a new struct, metaslab_group_allocator_t. The metaslab_group_t now
needs only one array indexed by the allocator number - which contains
the metaslab_group_allocator_t's.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10213
The progress of a send is supposed to be reported by `zfs send -v`, but
it is not. This works by creating a new user thread (with
pthread_create()) which does ZFS_IOC_SEND_PROGRESS ioctls to check how
much progress has been made. This IOCTL finds the specified send (since
there may be multiple concurrent sends in the system). The IOCTL also
checks that the specified send was started by the current process.
On Linux, different threads of the same process are represented as
different `struct task_struct`s (and, confusingly, have different
PID's). To check if if two threads are in the same process, we need to
check if they have the same `struct task_struct:group_leader`.
We used to to this correctly, but it was inadvertently changed by
30af21b025 (Redacted Send) to simply check if the current
`struct task_struct` is the one that started the send.
This commit changes the code back to checking if the send was started by
a `struct task_struct` with the same `group_leader` as the calling
thread.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Chris Wedgwood <cw@f00f.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10215Closes#10216
Minor fixes on persistent L2ARC improving code readability and fixing
a typo in zdb.c when byte-swapping a log block. It also improves the
pesist_l2arc_007_pos.ksh test by giving it more time to retrieve log
blocks on the cache device.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10210
Remove some obsolete legacy compat, rename some misnamed, and add some
missing tunables for FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10203
The memory and cpu cost of reference count tracking with the current
implementation is significant. For this reason it has always been
disabled by default for the kmods. Apply this same default to user
space so ztest doesn't always incur this performance penalty.
Our intention is to re-enable this by default for ztest once the code
has been optimized. Since we expect to at some point provide a FUSE
implementation we wouldn't want this enabled by default for libzpool.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10189
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#925Closes#1823Closes#2672Closes#3744Closes#9582
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.
Add warning messages when tunables are being ignored.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10157Closes#10158
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10178
Added to prevent a possible deadlock, the following comments from
FreeBSD explain the issue. The comment describing vn_io_fault_uiomove:
/*
* Helper function to perform the requested uiomove operation using
* the held pages for io->uio_iov[0].iov_base buffer instead of
* copyin/copyout. Access to the pages with uiomove_fromphys()
* instead of iov_base prevents page faults that could occur due to
* pmap_collect() invalidating the mapping created by
* vm_fault_quick_hold_pages(), or pageout daemon, page laundry or
* object cleanup revoking the write access from page mappings.
*
* Filesystems specified MNTK_NO_IOPF shall use vn_io_fault_uiomove()
* instead of plain uiomove().
*/
This used for vn_io_fault which has the following motivation:
/*
* The vn_io_fault() is a wrapper around vn_read() and vn_write() to
* prevent the following deadlock:
*
* Assume that the thread A reads from the vnode vp1 into userspace
* buffer buf1 backed by the pages of vnode vp2. If a page in buf1 is
* currently not resident, then system ends up with the call chain
* vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
* vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
* which establishes lock order vp1->vn_lock, then vp2->vn_lock.
* If, at the same time, thread B reads from vnode vp2 into buffer buf2
* backed by the pages of vnode vp1, and some page in buf2 is not
* resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
*
* To prevent the lock order reversal and deadlock, vn_io_fault() does
* not allow page faults to happen during VOP_READ() or VOP_WRITE().
* Instead, it first tries to do the whole range i/o with pagefaults
* disabled. If all pages in the i/o buffer are resident and mapped,
* VOP will succeed (ignoring the genuine filesystem errors).
* Otherwise, we get back EFAULT, and vn_io_fault() falls back to do
* i/o in chunks, with all pages in the chunk prefaulted and held
* using vm_fault_quick_hold_pages().
*
* Filesystems using this deadlock avoidance scheme should use the
* array of the held pages from uio, saved in the curthread->td_ma,
* instead of doing uiomove(). A helper function
* vn_io_fault_uiomove() converts uiomove request into
* uiomove_fromphys() over td_ma array.
*
* Since vnode locks do not cover the whole i/o anymore, rangelocks
* make the current i/o request atomic with respect to other i/os and
* truncations.
*/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10177
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.
To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.
With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10179
Add a mechanism to wait for delete queue to drain.
When doing redacted send/recv, many workflows involve deleting files
that contain sensitive data. Because of the way zfs handles file
deletions, snapshots taken quickly after a rm operation can sometimes
still contain the file in question, especially if the file is very
large. This can result in issues for redacted send/recv users who
expect the deleted files to be redacted in the send streams, and not
appear in their clones.
This change duplicates much of the zpool wait related logic into a
zfs wait command, which can be used to wait until the internal
deleteq has been drained. Additional wait activities may be added
in the future.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9707
Increasing l2arc_write_size or l2arc_write_boost can result in
l2arc_write_buffers() not having enough space to perform its writes and
panic zio_write_phys().
Instead of resetting l2ad_hand to l2ad_start at the end of
l2arc_write_buffers() and not taking into account a possible
user-mediated increase of l2arc_write_max, we do this in l2arc_evict(),
right after l2arc_write_size() has run. If there is not enough space to
evict (ie we will exceed l2ad_end) we evict to the end of the device,
reset l2ad_hand to l2ad_start, set l2ad_first to 0 and iterate
l2arc_evict(). We avoid infinite iteration of l2arc_evict() by making
sure in l2arc_write_size() that l2ad_start + size does not exceed
l2ad_end.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10154
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator. Other platforms
do not require the same logic.
Implement an arc_default_max() function to determine a default max ARC
size in platform code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10155
Make the cityhash code compile into libzfs, in preparation for the new
"zstream" command.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10152
This change adds a separate return code to zfs_ioc_recv that is used
for incomplete streams, in addition to the existing return code for
streams that contain corruption.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10122
For each WRITE record in the stream, `zfs receive` creates a DMU
transaction (`dmu_tx_create()`) and writes this block's data into the
object. If per-block overheads (as opposed to per-byte overheads)
dominate performance (as is often the case with small recordsize), the
per-dmu-transaction overheads can be significant. For example, in some
workloads the `receieve_writer` thread is 100% on CPU, and more than
half of its CPU time is in these per-tx routines (e.g.
dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit).
To improve performance of `zfs receive`, this commit batches WRITE
records which are to nearby offsets of the same object, and uses one DMU
transaction to write them all. By default the batch size is 1MB, which
for recordsize=8K reduces the number of DMU transactions by 128x for
full send streams (incrementals will depend on how "clumpy" the changed
blocks are).
This commit improves the performance of `dd if=stream | zfs recv`
from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement).
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10099
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9662Closes#10115
Attempt to run scrub or resilver on a new pool containing only special
allocations (special vdev added on creation) caused infinite loop
because of dsl_scan_should_clear() limiting memory usage to 5% of pool
size, which it calculated accounting only normal allocation class.
Addition of special and just in case dedup classes fixes the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#10106Closes#8694
dnode_special_close() waits for the refcount of dn_holds to go to zero
without holding the dn_mtx. dnode_rele_and_unlock() does the final
remove to dn_holds with dn_mtx being held:
refs = zfs_refcount_remove(&dn->dn_holds, tag);
mutex_exit(&dn->dn_mtx);
So, there is a race condition after the remove until dn_mtx is
dropped. During that time, dnode_destroy() can get called, which ends
up in dnode_dest() calling mutex_destroy() and a panic since the lock
is still held.
This change adds a condvar to wait for the final dnode_rele_and_unlock()
to release the dn_mtx before calling dnode_destroy().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes#7814Closes#10101
Using zfs with Lustre, an arc_read can trigger kernel memory allocation
that in turn leads to a memory reclaim callback and a deadlock within a
single zfs process. This change uses spl_fstrans_mark and
spl_trans_unmark to prevent the reclaim attempt and the deadlock
(https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba).
The stack trace observed is:
__schedule at ffffffff81610f2e
schedule at ffffffff81611558
schedule_preempt_disabled at ffffffff8161184a
__mutex_lock at ffffffff816131e8
arc_buf_destroy at ffffffffa0bf37d7 [zfs]
dbuf_destroy at ffffffffa0bfa6fe [zfs]
dbuf_evict_one at ffffffffa0bfaa96 [zfs]
dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs]
dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs]
osd_object_delete at ffffffffa0b64ecc [osd_zfs]
lu_object_free at ffffffffa06d6a74 [obdclass]
lu_site_purge_objects at ffffffffa06d7fc1 [obdclass]
lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass]
shrink_slab at ffffffff811ca9d8
shrink_node at ffffffff811cfd94
do_try_to_free_pages at ffffffff811cfe63
try_to_free_pages at ffffffff811d01c4
__alloc_pages_slowpath at ffffffff811be7f2
__alloc_pages_nodemask at ffffffff811bf3ed
new_slab at ffffffff81226304
___slab_alloc at ffffffff812272ab
__slab_alloc at ffffffff8122740c
kmem_cache_alloc at ffffffff81227578
spl_kmem_cache_alloc at ffffffffa048a1fd [spl]
arc_buf_alloc_impl at ffffffffa0befba2 [zfs]
arc_read at ffffffffa0bf0924 [zfs]
dbuf_read at ffffffffa0bf9083 [zfs]
dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Roper <markroper@gmail.com>
Closes#9987
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10067
Also dprintf_bp() in case BLK_VERIFY_HALT of zfs_blkptr_verify_log()
since dprintf_bp() in zfs_blkptr_verify() will never be executed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Justin Keogh <commits@v6y.net>
Closes#10086
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#10071
__zio_execute() calls zio_taskq_member() to determine if we are running
in a zio interrupt taskq, in which case we may need to switch to
processing this zio in a zio issue taskq. The call to
zio_taskq_member() can become a performance bottleneck when we are
processing a high rate of zio's.
zio_taskq_member() calls taskq_member() on each of the zio interrupt
taskqs, of which there are 21. This is slow because each call to
taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively
slow.
This commit improves the performance of zio_taskq_member() by having it
cache the value of tsd_get(taskq_tsd), reducing the number of those
calls to 1/21th of the current behavior.
In a test case running `zfs send -c >/dev/null` of a filesystem with
small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of
one CPU, and with this change it is reduced to 1.3%. Overall time to
perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000
blocks/sec).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10070
This function should only return "linux" on Linux.
Move the kernel part of the function out of common code.
Fix the tests for FreeBSD.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10079
FreeBSD has a somewhat more cumbersome locking and refcounting
protocol for the platform counterpart to znode. We need to not call
zrele on the passed zp, but do need to do so on any intermediate zp.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10075
By adding a zfs_file_private accessor to the common
interfaces and some extensions to FreeBSD platform
code it is now possible to share the implementations
for the aforementioned functions.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10073
When "zfs destroy" is run, it completes quickly, and in the background
we locate the blocks to free and free them. This background activity
can be observed with `zpool get freeing` and `zpool wait -t free ...`.
This background activity is processed by a single thread (the spa_sync
thread) which calls zio_free() on each of the blocks to free. With even
modest storage performance, the CPU consumption of zio_free() can be the
performance bottleneck.
Performance of zio_free() can be improved by not actually creating a
zio_t in the common case (non-dedup, non-gang), instead calling
metaslab_free() directly. This avoids the CPU cost of allocating the
zio_t, and more importantly the cost of adding and later removing this
zio_t from the parent zio's child list.
The result is that performance of background freeing more than doubles,
from 0.6 million blocks per second to 1.3 million blocks per second.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10034
The following check currently occurs in three separate locations
in dbuf.c. This change consolidates those checks in to the
dbuf_alloc_arcbuf_from_arcbuf() function.
if (arc_is_encrypted(data)) {
...
} else if (compress_type != ZIO_COMPRESS_OFF) {
...
} else {
...
}
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10057
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired. Callers are now required to use the time64_t type.
Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code. They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.
Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.
Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10052Closes#10064
* Add dedicated donde_set_dirtyctx routine.
* Add empty dirty record on destroy assertion.
* Make much more extensive use of the SET_ERROR macro.
Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9924
Sleepable (KM_SLEEP) allocations cannot fail. Hence
error handling for them is not useful.
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10031
The `convoff` function is called only in one code path in `zfs_space`.
Each caller of `zfs_space` is called with a `flock64_t` that has
`l_whence` set to `SEEK_SET`. This means that `convoff` always results
in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and
`int whence` is `SEEK_SET` as well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes#10006
There are several structs (and members of structs) related to redaction,
which are no longer used. This commit removes them.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10039
We have have made the necessary changes in our module code to expose
zevents through both devd and the zpool events ioctl. Now the tunables
can be exposed and zpool events tests can be enabled on both platforms.
A few minor tweaks to the tests were needed to accommodate the way wc
formats output on FreeBSD.
zed remains to be ported.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10008
Create dedicated dbuf_read_hole and dbuf_read_bonus.
Additionally, add a dtrace probe to allow state change tracing.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Authored-by: Will Andrews <wca@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9923
This adds support for setting user properties in a
zfs channel program by adding 'zfs.sync.set_prop'
and 'zfs.check.set_prop' to the ZFS LUA API.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Contributions-by: Jason King <jason.king@joyent.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes#9950
The module parameter zfs_async_block_max_blocks limits the number of
blocks that can be freed by the background freeing of filesystems and
snapshots (from "zfs destroy"), in one TXG. This is useful when freeing
dedup blocks, becuase each zio_free() of a dedup block can require an
i/o to read the relevant part of the dedup table (DDT), and will also
dirty that block.
zfs_async_block_max_blocks is set to 100,000 by default. For the more
typical case where dedup is not used, this can have a negative
performance impact on the rate of background freeing (from "zfs
destroy"). For example, with recordsize=8k, and TXG's syncing once
every 5 seconds, we can free only 160MB of data per second, which may be
much less than the rate we can write data.
This change increases zfs_async_block_max_blocks to be unlimited by
default. To address the dedup freeing issue, a new tunable is
introduced, zfs_max_async_dedup_frees, which limits the number of
zio_free()'s of dedup blocks done by background destroys, per txg. The
default is 100,000 free's (same as the old zfs_async_block_max_blocks
default).
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10000
Since AVL already has embedded element counter, use dn_dbufs_count
only for dbufs not counted there (bonus buffers) and just add them.
This removes two atomics per dbuf life cycle.
According to profiler it reduces time spent by dbuf_destroy() inside
bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9949
Add support for bookmark creation and cloning.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9571
This feature allows copying existing bookmarks using
zfs bookmark fs#target fs#newbookmark
There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.
Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).
ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.
Overview:
- Terminology:
- source = existing snapshot or bookmark
- new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
- create new bookmark node
- copy source's `zbn_phys` to new's `zbn_phys`
- zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
bookmarks as sources
- => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
and bookmark sources
- Adjust CLI
- refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
- warn about redaction bookmark handling
- Add test cases
- CLI
- pyyzfs libzfs_core bindings
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9571
Coverity reports the variable may be NULL, but due to the
way the dirty records are handled this cannot be the case.
Add a comment and VERIFY to make this clear and silence
the warning.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9962
As explained by the comment in dbuf_read() and above dbuf_read_impl().
Under all circumstances the parent lock specified by dblt should be
dropped when existing dbuf_read_impl(). This was not being done for
two exist paths. Additionally, ensure the mutex is unlocked before
dropping the parent lock.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9968
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display. Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9640Closes#9729
We need to do the same thing to update all spas on any OS for these
tunables, so let's share the code.
While here let's match the types of the literals initializing the
variables with the type of the variable.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9964
Factor the portion of dbuf_sync_leaf() responsible for handling bonus
buffers out in to its own dbuf_sync_bonus() helper function.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9909
Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to
re-borrow after the flush. But since asc_borrowed and asc_delta are
accessed only while holding asc_lock, it makes no any sense to modify
as_lower_bound and as_upper_bound in multiple steps. Instead of that
the new code uses only 2 atomics in all the cases, one per as_*_bound
variable. I think even that is overkill, simple atomic store and
load could be used here, since all modifications are done under the
as_lock, but there are no such primitives in ZFS code now.
While there, make borrow code consider previous borrow value, so that
on mixed request patterns reduce chance of needing to borrow again if
much larger request follows tiny one that needed borrow.
Also reduce as_numbuckets from uint64_t to u_int. It makes no sense
to use so large division operation on every aggsum_add().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9930
Move db_link into the same cache line as db_blkid and db_level.
It allows significantly reduce avl_add() time in dbuf_create() on
systems with large RAM and huge number of dbufs per dnode.
Avoid few accesses to dbuf_caches[].size, which is highly congested
under high IOPS and never stays in cache for a long time. Use local
value we are receiving from zfs_refcount_add_many() any way.
Remove cache_size_bytes_max bump from dbuf_evict_one(). I don't see
a point to do it on dbuf eviction after we done it on insertion in
dbuf_rele_and_unlock().
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9931
Additionally pull in state machine comments about
upcoming async cow work.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9902
It violated sequence described in kstat.h, and at least on FreeBSD
kstat_install() uses provided names to create the sysctls. If the
names are not available at the time, it ends up bad.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9933
Clang warns (errors) that "cast from 'const void *' to 'struct v *'
drops const qualifier."
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9917
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes#7863Closes#9885
Discovered in preparation of zcp support for creating bookmarks.
Handle the case where dbca_errors is NULL.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9880
Implements the RAID-Z function using AltiVec SIMD.
This is basically the NEON code translated to AltiVec.
Note that the 'fletcher' algorithm requires 64-bits
operations, and the initial implementations of AltiVec
(PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to
32-bits operations, so no 'fletcher'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes#9539
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9867
Now that the FreeBSD zfs_vnops code avoids asserting that
a vnode lock is held when z_replay is true we can limit
the FreeBSD specific changes to the couple of changes
where it is necessary to drop the vnode locks because
a function returns with it held.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9865
This adds support in channel programs to inherit properties analogous
to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit`
functions to the ZFS LUA API.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes#9738
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9861
With recent SPL changes there is no longer any need for a per
platform version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9860
Over the years several slightly different approaches were used
in the Makefiles to determine the target architecture. This
change updates both the build system and Makefile to handle
this in a consistent fashion.
TARGET_CPU is set to i386, x86_64, powerpc, aarch6 or sparc64
and made available in the Makefiles to be used as appropriate.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9848
Currently, the handling for errata #4 has two issues which allow
the checks for this issue to be bypassed using resumable sends.
The first issue is that drc->drc_fromsnapobj is not set in the
resuming code as it is in the non-resuming code. This causes
dsl_crypto_recv_key_check() to skip its checks for the
from_ivset_guid. The second issue is that resumable sends do not
clean up their on-disk state if they fail the checks in
dmu_recv_stream() that happen before any data is received.
As a result of these two bugs, a user can attempt a resumable send
of a dataset without a from_ivset_guid. This will fail the initial
dmu_recv_stream() checks, leaving a valid resume state. The send
can then be resumed, which skips those checks, allowing the receive
to be completed.
This commit fixes these issues by setting drc->drc_fromsnapobj in
the resuming receive path and by ensuring that resumablereceives
are properly cleaned up if they fail the initial dmu_recv_stream()
checks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9818Closes#9829
This commit adds the --saved (-S) to the 'zfs send' command.
This flag allows a user to send a partially received dataset,
which can be useful when migrating a backup server to new
hardware. This flag is compatible with resumable receives, so
even if the saved send is interrupted, it can be resumed.
The flag does not require any user / kernel ABI changes or any
new feature flags in the send stream format.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Christian Schwarz <me@cschwarz.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9007
For dedup, special and log devices "zpool add -n" does not print
correctly their vdev type:
~# zpool add -n pool dedup /tmp/dedup special /tmp/special log /tmp/log
would update 'pool' to the following configuration:
pool
/tmp/normal
/tmp/dedup
/tmp/special
/tmp/log
This could lead storage administrators to modify their ZFS pools to
unexpected and unintended vdev configurations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9783Closes#9390
- Skip invalid DVAs when importing pools in readonly mode
(in addition to when the config is untrusted).
- Upon encountering a DVA with a null VDEV, fail gracefully
instead of panicking with a NULL pointer dereference.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Steve Mokris <smokris@softpixel.com>
Closes#9022
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8602Closes#9751
The dnp argument can only be set to NULL when the DNODE_DRY_RUN flag
is set. In which case, an early return path will be executed and a
NULL pointer dereference at the given location is impossible. Add
an additional ASSERT to silence the cppcheck warning and document
that dbp must never be NULL at the point in the function.
[module/zfs/dnode.c:1566]: (warning) Possible null pointer deref: dnp
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9732
The NEON code replicates too closely the SSE code, including
a masked 16-bits shift. But NEON, like AltiVec (#9539), has
unsigned 8-bits shift, so use that instead and drop the masking.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes#9725
Explain FreeBSD VFS' unfortunate idiosyncratic locking requirements.
There is no functional change for other platforms.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9720
Currently, 'zfs list' and 'zfs get' commands can be slow when
working with snapshots that have a ds_props_obj. This is
because the code that discovers all of the properties for these
snapshots needs to read this object for each snapshot, which
almost always ends up causing an extra random synchronous read
for each snapshot. This performance penalty exists even if the
properties on that snapshot have been unset because the object
is normally only freed when the snapshot is freed, even though
it is only created when it is needed.
This patch allows the user to regain 'zfs list' performance on
these snapshots by destroying the ds_props_obj when it no longer
has any entries left. In practice on a production machine, this
optimization seems to make 'zfs list' about 55% faster.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9704
FreeBSD's vfs currently doesn't permit file systems
to do their own locking. To avoid having to have
duplicate zfs functions with and without locking add
locking here. With luck these changes can be removed
in the future.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9715
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed). Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().
Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.
Found by AddressSanitizer:
ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
#0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
#1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
#2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
#3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
#4 0x55ffbc769fba in ztest_execute ztest.c:6714
#5 0x55ffbc779a90 in ztest_thread ztest.c:6761
#6 0x7fe889cbc6da in start_thread
#7 0x7fe8899e588e in __clone
0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
#0 0x7fe88b14e7b8 in __interceptor_free
#1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
#2 0x7fe88ae543ba in nvpair_free nvpair.c:844
#3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
#4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
#5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
#6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
#7 0x55ffbc769fba in ztest_execute ztest.c:6714
#8 0x55ffbc779a90 in ztest_thread ztest.c:6761
#9 0x7fe889cbc6da in start_thread
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9706
The quota functions are common to all implementations and can be
moved to common code. As a simplification they were moved to the
Linux platform code in the initial refactoring.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9710
Add the 'zfs jail/unjail' subcommands along with the relevant
documentation from FreeBSD. This feature is not supported on
Linux and still requires the match kernel ioctls which will
be included when the FreeBSD platform code is integrated.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9686
Change many of the znops routines to take a znode rather
than an inode so that zfs_replay code can be largely shared
and in the future the much of the znops code may be shared.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9708
This interferes with zdb_read_block trying all the decompression
algorithms when the 'd' flag is specified, as some are
expected to fail. Also control the output when guessing
algorithms, try the more common compression types first, allow
specifying lsize/psize, and fix an uninitialized variable.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9612Closes#9630
The zfsvfs->z_sb field is Linux specified and should be abstracted.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9697
This change allows us to align the code dump logic across platforms.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9691
The dsl_dataset_deactivate_feature_impl() function is private and
should be marked as such.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9696
FreeBSD uses its own crypto framework in-kernel which, at this time,
has no EDONR implementation.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9664
Update zfs_deadman_failmode to use the ZFS_MODULE_PARAM_CALL
wrapper, and split the common and platform specific portions.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9670
Remove the ASSERTV macro and handle suppressing unused
compiler warnings for variables only in ASSERTs using the
__attribute__((unused)) compiler annotation. The annotation
is understood by both gcc and clang.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9671
In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO
to read data from the original storage device. Unfortunately pointer
to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if
some other read try to bump the ZIO priority, it will crash.
The problem is reproducible by corrupting L2ARC content and reading
some data with prefetch if l2arc_noprefetch tunable is changed to 0.
With the default setting the issue is probably not reproducible now.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9648
The module_param_call() functionality is currently still
Linux-specific and should be wrapped accordingly.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9666
The write_record() function is private and should be marked as such.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9665
FreeBSD needs to cope with multiple version of the zfs_cmd_t
structure. Allowing the platform code to pre and post
process the cmd structure makes it possible to work with
legacy tooling.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9624
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #840Closes#9155Closes#9378Closes#9551Closes#9588
Provide a common zfs_file_* interface which can be implemented on all
platforms to perform normal file access from either the kernel module
or the libzpool library.
This allows all non-portable vnode_t usage in the common code to be
replaced by the new portable zfs_file_t. The associated vnode and
kobj compatibility functions, types, and macros have been removed
from the SPL. Moving forward, vnodes should only be used in platform
specific code when provided by the native operating system.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9556
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record. Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads. It also didn't help with SLOG space
efficiency in that case.
This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using
one 128KB block to not increase processing overhead. Writes above
131KB are still using full 128KB blocks, since possible saving there
is small. Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#9409
Some of the znode fields are different and functions
consuming an inode don't exist on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9536
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.
Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.
Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.
Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#9525
A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
This change is a follow up to 2cc479d0.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9534
Address two prototype related warnings emitted by clang.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9535
Move these Linux module parameter get/set helpers in to
platform specific code.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9457
Currently, when you call 'zfs change-key' on an encrypted dataset
that has an unencrypted child, the code will trigger a VERIFY.
This VERIFY is leftover from before we allowed unencrypted
datasets to exist underneath encrypted ones. This patch fixes the
issue by simply replacing the VERIFY with an early return when
recursing through datasets.
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9524
- FreeBSD's rootpool import code uses spa_config_parse
- Move the zvol_create_minors call out from under the
spa_namespace_lock in spa_import. It isn't needed and it causes
a lock order reversal on FreeBSD.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9499
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.
A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9487Closes#9489
This assert makes non portable assumptions about the state of memory
returned by the memory allocator.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9506
This logic is not platform dependent and should reside in the
common code.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9505
It's mostly a noop on ZoL and it conflicts with platforms that
support dtrace. Remove this header to resolve the conflict.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9497
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.
This has the additional advantage of allowing us to use the FPU
again in user threads. So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9346Closes#9403
Fixes an obvious issue of calling arc_buf_destroy() on an
unallocated arc_buf.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#9453
Factor Linux specific memory pressure handling out of ARC. Each
platform will have different available interfaces for managing memory
pressure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9472
Only pass the file descriptor to make zfsdev_get_miror() portable.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9466
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9465
Clang will complain if a function has no prior declaration
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9467
This addresses a number of problems with dmu_send.c:
* bp_span is unused which makes clang complain
* dump_write conflicts with FreeBSD's existing core dump code
* range_alloc is private to the file and not declared in any headers
causing clang to complain
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9432
We get the sizeof the appropriate type, and don't cast away const.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9455
FreeBSD has a very different implementation.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9442
FreeBSD has its own implementation as do other platforms.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9439
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#9452
Rename certain functions for more consistency when they share common
features. Make comments clearer about what arguments should be passed
to the insert and add functions.
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9441
FreeBSD uses this in its pager ops routines
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9431
When "feature@allocation_classes" is not enabled on the pool no vdev
with "special" or "dedup" allocation type should be allowed to exist in
the vdev tree.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9427Closes#9429
Temporary property handling at the VFS layer requires
platform specific code.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9401
Make the metaslab platform agnostic again by adding
accessor functions which can be implemented by each
platform.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9404
FreeBSD's zvol platform code requires access to the
zil_async_to_sync() function.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9440
In the FreeBSD kernel the strdup signature is:
```
char *strdup(const char *__restrict, struct malloc_type *);
```
It's unfortunate that the developers have chosen to change
the signature of libc functions - but it's what I have to
deal with.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9433
The macros are used to generate code for conditions without a
corresponding branch. This is not a problem in practice, but
clang has no way of knowing that. Add a default branch with a
VERIFY(0) to indicate that it "can't happen"
```
In file included from \
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_sse2.c:607:
/usr/home/mmacy/devel/ZoF/module/zfs/vdev_raidz_math_impl.h:281:3: \
error: no case matching constant switch condition '3' [-Werror]
```
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9434
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9181
A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9402
Make arc_stats visible to platform code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9386
We've seen cases where after creating a ZVOL, the ZVOL device node in
"/dev" isn't generated after 20 seconds of waiting, which is the point
at which our applications gives up on waiting and reports an error.
The workload when this occurs is to "refresh" 400+ ZVOLs roughly at the
same time, based on a policy set by the user. This refresh operation
will destroy the ZVOL, and re-create it based on a snapshot.
When this occurs, we see many hundreds of entries on the "z_zvol" taskq
(based on inspection of the /proc/spl/taskq-all file). Many of the
entries on the taskq end up in the "zvol_remove_minors_impl" function,
and I've measured the latency of that function:
Function = zvol_remove_minors_impl
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 45 |****************************************|
256 -> 511 : 5 |**** |
That data is from a 10 second sample, using the BCC "funclatency" tool.
As we can see, in this 10 second sample, most calls took 128ms at a
minimum. Thus, some basic math tells us that in any 20 second interval,
we could only process at most about 150 removals, which is much less
than the 400+ that'll occur based on the workload.
As a result of this, and since all ZVOL minor operations will go through
the single threaded "z_zvol" taskq, the latency for creating a single
ZVOL device can be unreasonably large due to other ZVOL activity on the
system. In our case, it's large enough to cause the application to
generate an error and fail the operation.
When profiling the "zvol_remove_minors_impl" function, I saw that most
of the time in the function was spent off-cpu, blocked in the function
"taskq_wait_outstanding". How this works, is "zvol_remove_minors_impl"
will dispatch calls to "zvol_free" using the "system_taskq", and then
the "taskq_wait_outstanding" function is used to wait for all of those
dispatched calls to occur before "zvol_remove_minors_impl" will return.
As far as I can tell, "zvol_remove_minors_impl" doesn't necessarily have
to wait for all calls to "zvol_free" to occur before it returns. Thus,
this change removes the call to "taskq_wait_oustanding", so that calls
to "zvol_free" don't affect the latency of "zvol_remove_minors_impl".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#9380
Refactor the zfs ioctls in to platform dependent and independent bits.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9301
Refactor the zvol in to platform dependent and independent bits.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9295
Trying to 'zfs diff' a snapshot with large dnodes will incorrectly try
to access its interior slots when dnodesize > sizeof(dnode_phys_t).
This is normally not an issue because the interior slots are
zero-filled, which report_dnode() handles calling
report_free_dnode_range(). However this is not the case for encrypted
large dnodes or filesystem using many SA based xattrs where the extra
data past the legacy dnode size boundary is interpreted as a
dnode_phys_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7678Closes#8931Closes#9343
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.
This was originally fixed in illumos-joyent as OS-7982.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes#9299Closes#9338
The was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.
In both cases, if ds1 was already present, then it would be first
replaced with ds2 and then ds would be replaced back with ds1.
Also, both cases did not properly handle a situation where both ds1 and
ds2 are already queued. A duplicate insertion would be attempted and
its failure would result in a panic.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#9140Closes#9163
This commit fixes a NULL pointer dereference triggered in
spa_vdev_remove_top_check() by trying to "zpool remove" an indirect
vdev.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9327
This commit fixes the following build failure detected on Debian9
(GCC 6.3.0):
CC [M] module/zfs/spa.o
module/zfs/spa.c: In function ‘spa_wait_common.part.31’:
module/zfs/spa.c:9468:6: error: ‘in_progress’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
if (!in_progress || spa->spa_waiters_cancel || error)
^
cc1: all warnings being treated as errors
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9326
Currently, spa_keystore_change_key_sync_impl() does not recurse
into clones when updating encryption roots for either a call to
'zfs promote' or 'zfs change-key'. This can cause children of
these clones to end up in a state where they point to the wrong
dataset as the encryption root. It can also trigger ASSERTs in
some cases where the code checks reference counts on wrapping
keys. This patch fixes this issue by ensuring that this function
properly recurses into clones during processing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9267Closes#9294
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#9162
objnode is OS agnostic and used only by dmu_redact.c.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9315
Move Linux specific tracing headers and source to platform directories
and update the build system.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9290
Currently, the DSL scan code figures out when it should suspend
processing and allow a txg to continue by calling the function
dsl_scan_check_suspend(). Unfortunately, this function only
allows the scan to suspend at a level 0 block. In the event that
the system is scanning a bunch of empty snapshots or a resilver
is running with a high enough scn_cur_min_txg, the scan will
stop processing each dataset at the root level, deciding it
has nothing left to do. This means that the check_suspend
function is never called and the txg remains stuck until a
dataset is found that has data to scan.
This patch fixes the problem by allowing scans to suspend at
the root level of the objset. For backwards compatibility, we
use the bookmark <objsetid, 0, 0, 0> when we suspend here so
that older versions of the code will work as intended.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9300
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify(). A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool. This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer. For example, to perform a
`zpool attach` as shown in the abbreviated stack below.
To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property. The cached
value is then relied upon for subsequent accesses.
Call Trace:
spa_config_enter+0x1e8/0x350 [zfs]
zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
zio_read+0x6c/0x140 [zfs]
...
vfs_read+0xfc/0x1e0
kernel_read+0x50/0x90
...
spa_get_hostid+0x1c/0x38 [zfs]
spa_config_generate+0x1a0/0x610 [zfs]
vdev_label_init+0xa0/0xc80 [zfs]
vdev_create+0x98/0xe0 [zfs]
spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9256Closes#9285
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context. This was done
intentionally since it's a relatively infrequent operation. However,
this also meant that the encryption context templates were initialized
using the generic operations. Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.
Resolve the issue by initializing the context templates in an I/O
pipeline thread. And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context. For example, when performing a read from the ARC.
Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9215Closes#9296
Move platform specific Linux source under module/os/linux/
and update the build system accordingly. Additional code
restructuring will follow to make the common code fully
portable.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Closes#9206
Adds ZFS_MODULE_PARAM to abstract module parameter
setting to operating systems other than Linux.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9230
`metaslab_verify_weight_and_frag()` a verification function and
by the end of it there shouldn't be any side-effects.
The function calls `metaslab_weight()` which in turn calls
`metaslab_set_fragmentation()`. The latter can dirty and otherwise
not dirty metaslab fro the next TXGand set `metaslab_condense_wanted`
if the spacemaps were just upgraded (meaning we just enabled the
SPACEMAP_HISTOGRAM feature through upgrade).
This patch adds a new flag as a parameter to `metaslab_weight()` and
`metaslab_set_fragmentation()` making the dirtying of the metaslab
optional.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9185Closes#9282
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.
As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context. No functional change, the source has been
refactored and the relevant #include's updated.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9198
Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value
is ignored in the Linux kstat code but resolves the issue for other
platforms.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9254Closes#9151
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).
Additionally, distinguish incremental and resume streams in error
messages.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#9252
This change use the compat code introduced in 9cc1844a.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9268Closes#9269
When running on larger memory systems, we can overflow the value of
maxinflight. This can result in maxinflight having a value of 0 causing
the system to hang.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#9272
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9240
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9185Closes#9186Closes#9231Closes#9253
With the other metaslab changes loaded onto a system, we can
significantly reduce the memory usage of each loaded metaslab and
unload them on demand if there is memory pressure. However, none
of those changes actually result in us keeping more metaslabs loaded.
If we don't keep more metaslabs loaded, we will still have to wait
for demand-loading to finish when no loaded metaslab can satisfy our
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while
unloading is currently done on a txg-count basis. Txgs can take a
widely varying range of times, from tenths of a second to several
seconds. This can result in confusing, hard to predict behavior.
This change simply adds a time-based component to metaslab unloading.
A metaslab will remain loaded for one minute and 8 txgs (by default)
after it was last used, unless it is evicted due to memory pressure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-65016
External-issue: DLPX-65047
Closes#9197
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.
This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.
Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
and smaller gains as thread count increases. At >64 threads, 2-5%
decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
leveling out around 64 threads. At >64 threads, 5-10% decrease in
performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
high thread count.
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes#9217
Tag the ABD data pages so that they can be identified for exclusion
from kernel crash dumps. Eliminating the zfs file data allows for
significantly smaller crash dump files. Note that ZFS in illumos has
always excluded the zfs data pages from a kernel crash dump.
This change tags ARC scatter data pages so they can be identified from
the makedumpfile(8) command. That command is used to create smaller
dump files by ignoring some memory regions and using compression. It
already filters file data from the VFS page cache and will now be able
to exclude ZFS file data pages from the dump file.
A corresponding change to makeumpfile(8) is required to identify ZFS
data pages.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#8899
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.
We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7151Closes#8910Closes#9123Closes#9145
Previously, the permissions were checked on the pool which was obviously
incorrect.
After this change, zfs_check_userprops() only validates the properties
without any permission checks. The permissions are checked individually
for each snapshotted dataset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes#9179Closes#9180
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.
This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9203
Fix some switch() fall-though compiler errors:
abd.c:1504:9: error: this statement may fall through
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#9170
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing. With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.
By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes#8744
There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
zap_lockdir_impl+0x858/0x16c0 [zfs]
zap_lockdir+0xd2/0x100 [zfs]
zap_lookup_norm+0x7f/0x100 [zfs]
zap_lookup+0x12/0x20 [zfs]
sa_setup+0x902/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
and thread 2 trying to get sa_lock, either in sa_setup:
sa_setup+0x742/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
sa_build_index+0x13d/0x790 [zfs]
sa_handle_get_from_db+0x368/0x500 [zfs]
zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
zfs_znode_alloc+0x3da/0x1a40 [zfs]
zfs_zget+0x39a/0x6e0 [zfs]
zfs_root+0x101/0x160 [zfs]
zfs_domount+0x91f/0xea0 [zfs]
From there, there are different locking paths back to something
holding zap->zap_rwlock.
The deadlock scenarios involve multiple different ZFS filesystems
being mounted. sa_lock is common to these scenarios, and the sa
struct involved is private to a mount. Therefore, these must be
referring to different sa_lock instances and these deadlocks can't
occur in practice.
The fix, from Brian Behlendorf, is to remove sa_lock from lockdep
coverage by initializing it with MUTEX_NOLOCKDEP.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#9110
On systems with large amounts of storage and high fragmentation, a huge
amount of space can be used by storing metaslab range trees. Since
metaslabs are only unloaded during a txg sync, and only if they have
been inactive for 8 txgs, it is possible to get into a state where all
of the system's memory is consumed by range trees and metaslabs, and
txgs cannot sync. While ZFS knows how to evict ARC data when needed,
it has no such mechanism for range tree data. This can result in boot
hangs for some system configurations.
First, we add the ability to unload metaslabs outside of syncing
context. Second, we store a multilist of all loaded metaslabs, sorted
by their selection txg, so we can quickly identify the oldest
metaslabs. We use a multilist to reduce lock contention during heavy
write workloads. Finally, we add logic that will unload a metaslab
when we're loading a new metaslab, if we're using more than a certain
fraction of the available memory on range trees.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9128
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes#9137
In zfs_log_write(), we can use dmu_read_by_dnode() rather than
dmu_read() thus avoiding unnecessary dnode_hold() calls.
We get a 2-5% performance gain for large sequential_writes tests, >=128K
writes to files with recordsize=8K.
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes#9156
This patch introduces an assertion that can catch pitfalls in
development where there is a mismatch between the size of
reads and writes between a *_phys structure and its respective
in-core structure when bonus buffers are used.
This debugging-aid should be complementary to the verification
done by ztest in ztest_verify_dnode_bt().
A side to this patch is that we now clear out any extra bytes
past a bonus buffer's new size when the buffer is shrinking.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8348
The call to txg_wait_synced in zfsvfs_teardown should
be made conditional on the objset having dirty data.
This can prevent unnecessary txg_wait_synced during
some unmount operations.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9115
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run
into a race condition where that vdev is temporarily unavailable. This
happens when a device removal operation and the old vdev_t has been
removed from the array, but the new indirect vdev has not yet been
inserted.
We hold the spa_config_lock while doing our sensitive verification.
To ensure that we don't deadlock, we only grab the lock if we don't
have config_writer held. In addition, I had to const the tags of the
refcounts and the spa_config_lock arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9112
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.
We fix this by only calling zil_remove_async when the file is fully
unlinked.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#8769Closes#9061
This function is not used outside of dsl_dataset.c
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes#9154
When a pool is imported it will scan the pool to verify the integrity
of the data and metadata. The amount it scans will depend on the
import flags provided. On systems with small amounts of memory or
when importing a pool from the crash kernel, it's possible for
spa_load_verify to issue too many I/Os that it consumes all the memory
of the system resulting in an OOM message or a hang.
To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes#9146
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9092
Consumers of ZFS Channel Programs can now list bookmarks,
and get holds from datasets. A minor-refactoring was also
applied to distinguish between user and system properties
in ZCP.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Ported-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Dan Kimmel <dan.kimmel@delphix.com>
OpenZFS-issue: https://illumos.org/issues/8862Closes#7902
Beside the whole commit being a nit in reality it should
bring the diffs of the spa_log_spacemap.c source file
between ZoL and delphix/zfs to 0.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9143
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.
This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9055
ZED can prevent CPU's from properly sleeping.
Rather than periodically waking up in the zevents code, just go to sleep and wait for a wakeup.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#9091
This fixes a lockdep warning by breaking a link between ->tx_sync_lock
and ->dp_lock.
The deadlock envisioned by lockdep is this:
thread 1 holds db->db_mtx and tries to get dp->dp_lock:
dsl_pool_dirty_space+0x70/0x2d0 [zfs]
dbuf_dirty+0x778/0x31d0 [zfs]
thread 2 holds bpo->bpo_lock and tries to get db->db_mtx:
dmu_buf_will_dirty_impl
dmu_buf_will_dirty+0x6b/0x6c0 [zfs]
bpobj_iterate_impl+0xbe6/0x1410 [zfs]
thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock:
bpobj_space+0x63/0x470 [zfs]
dsl_scan_active+0x340/0x3d0 [zfs]
txg_sync_thread+0x3f2/0x1370 [zfs]
thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock
txg_kick+0x61/0x420 [zfs]
dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs]
This patch is orginally from Brian Behlendorf and slightly simplified
by me.
It breaks this cycle in thread 4 by moving the call from
dsl_pool_need_dirty_delay to txg_kick outside the section controlled
by dp->dp_lock.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#9094
In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve()
to return errors other than the expected ENOENT (e.g. when we are at
the end of the zap). Ensure that these error cases are handled
correctly by the import path.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9074
When the log spacemap commit was merged in ZoL, the
metaslab_verify_unflushed_changes() debugging function
was deleted as the feature was pretty much stable by
then. Unfortunately though there was a reference to
it from a comment in metaslab_verify_weight_and_frag().
This patch deletes the reference and pastes that
comment as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9097
In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand. This eliminates some calls to dnode_hold(), which can be
expensive. This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.
We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.
This also removes some unnecessary casts that are in the area.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9081
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.
https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9035Closes#9043
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.
This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.
We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes#8416
Cast to uintptr_t first for portability on integer to/from pointer
conversion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9065
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex. Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9062
metaslab_should_allocate() is used in two places:
[1] When trying to select a metaslab to allocate from
[2] When trying to allocate from a metaslab
In [2] we always expect the metaslab to be loaded, and after
the refactoring of the log spacemap changes, whenever we load
a metaslab we set ms_max_size to the biggest range in the
ms_allocatable tree. Thus, when it is used in [2], if that
field is 0, it means that the metaslab doesn't have any
segments that can be used for allocations now (though it may
have some free space but that space can be in the freeing,
freed, or deferred trees).
In [1] a metaslab can be loaded or unloaded at which point 0
can either mean the metaslab doesn't have any space or the
metaslab is just not loaded thus we go ahead and try to make
an estimation based on its weight.
The issue here is when we call the above function for [2] and
the metaslab doesn't have any allocatable space, we still go
ahead and check its ms_weight which may be out of date because
we haven't ran metaslab_sync_done() yet. At that point we are
allowing an allocation to be attempted even though we know
there is no range that is allocatable.
This patch fixes this issue by explicitly checking if the
metaslab is loaded and if it is, the ms_max_size is used.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9045
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).
This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9015Closes#9044
There exists a race condition were hdr_recl() calls
zthr_wakeup() on a destroyed zthr. The timeline is the
following:
[1] hdr_recl() runs first and goes intro zthr_wakeup()
because arc_initialized is set.
[2] arc_fini() is called by another thread, zeroes
that flag, destroying the zthr, and goes into
buf_init().
[3] hdr_recl() tries to enter the destroyed mutex
and we blow up.
This patch ensures that the ARC's zthrs are not offloaded
any new work once arc_initialized is set and then destroys
them after all of the ARC state has been deleted.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9047
These aren't tunable; illumos has this comment fixed in
"3742 zfs comments need cleaner, more consistent style",
so sync with that.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9052
lockdep reports a possible recursive lock in dbuf_destroy.
It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode. However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.
This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.
In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.
This is the lockdep report:
[ 122.950921] ============================================
[ 122.950921] WARNING: possible recursive locking detected
[ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O
[ 122.950921] --------------------------------------------
[ 122.950921] dbu_evict/1457 is trying to acquire lock:
[ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[ 122.950921]
but task is already holding lock:
[ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[ 122.950921]
other info that might help us debug this:
[ 122.950921] Possible unsafe locking scenario:
[ 122.950921] CPU0
[ 122.950921] ----
[ 122.950921] lock(&dn->dn_dbufs_mtx);
[ 122.950921] lock(&dn->dn_dbufs_mtx);
[ 122.950921]
*** DEADLOCK ***
[ 122.950921] May be due to missing lock nesting notation
[ 122.950921] 1 lock held by dbu_evict/1457:
[ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[ 122.950921]
stack backtrace:
[ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1
[ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009
[ 122.950921] Call Trace:
[ 122.950921] dump_stack+0x91/0xeb
[ 122.950921] __lock_acquire+0x2ca7/0x4f10
[ 122.950921] lock_acquire+0x153/0x330
[ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs]
[ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs]
[ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs]
[ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs]
[ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs]
[ 122.950921] taskq_thread+0x979/0x1480 [spl]
[ 122.950921] kthread+0x2e7/0x3e0
[ 122.950921] ret_from_fork+0x27/0x50
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#8984
zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9039
The cast of the size_t returned by strlcpy() to a uint64_t by the
VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE
is set. This is due to the additional hardening. Since the token
is expected to always fit in strval the VERIFY3U has been removed.
If somehow it doesn't, it will still be safely truncated.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8999Closes#9020
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe
Change target size of metaslabs from 256GB to 16GB
c853f382db
zdb -L should skip leak detection altogether
21e7cf5da8
vs_alloc can underflow in L2ARC vdevs
7558997d2f
Simplify log vdev removal code
6c926f426a
Get rid of space_map_update() for ms_synced_length
425d3237ee
Introduce auxiliary metaslab histograms
928e8ad47d
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8442