When a rebuild completes it will automatically schedule a follow up
scrub to verify all of the block checksums. Before setting up the
scrub execute the counterpart dsl_scan_setup_check() function to
confirm the scrub can be started. Prior to this change we'd only
check vdev_rebuild_active() which isn't as comprehensive, and using
the check function keeps all of this logic in one place.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11849
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.
Ratelimit deadman zevents according to the same tunable as for delay
zevents.
Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11786
Correct an assortment of typos throughout the code base.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#11774
For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the
gang DVA including its children BP's. The space allocated at each DVA's
vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`.
This commit makes this relationship more clear by using a helper
function, `vdev_gang_header_asize()`, for the space allocated at the
gang block's vdev/offset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11744
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#11775
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random. But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.
This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run(). The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued. It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.
For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients. Benefits of those ways
varied, but neither was perfect. With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.
While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock. Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.
Delete prefetch streams when they reach ends of files. It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.
Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache. First cache miss immediately fires all the prefetch
that would be done for the stream by that time. It saves some CPU
time if same files within DMU cache capacity are read over and over.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#11652
If TX_WRITE is create on a file, and the file is later deleted and a new
directory is created on the same object id, it is possible that when
zil_commit happens, zfs_get_data will be called on the new directory.
This may result in panic as it tries to do range lock.
This patch fixes this issue by record the generation number during
zfs_log_write, so zfs_get_data can check if the object is valid.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#10593Closes#11682
The RAIDZ and DRAID code is responsible for reporting checksum errors on
their child vdevs. Checksum errors represent events where a disk
returned data or parity that should have been correct, but was not. In
other words, these are instances of silent data corruption. The
checksum errors show up in the vdev stats (and thus `zpool status`'s
CKSUM column), and in the event log (`zpool events`).
Note, this is in contrast with the more common "noisy" errors where a
disk goes offline, in which case ZFS knows that the disk is bad and
doesn't try to read it, or the device returns an error on the requested
read or write operation.
RAIDZ/DRAID generate checksum errors via three code paths:
1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
reported on any children whose data was not used during the
reconstruction. This is handled in `raidz_reconstruct()`. This is the
most common type of RAIDZ/DRAID checksum error.
2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
means that the data has been lost. The zio fails and an error is
returned to the consumer (e.g. the read(2) system call). This would
happen if, for example, three different disks in a RAIDZ2 group are
silently damaged. Since the damage is silent, it isn't possible to know
which three disks are damaged, so a checksum error is reported against
every child that returned data or parity for this read. (For DRAID,
typically only one "group" of children is involved in each io.) This
case is handled in `vdev_raidz_cksum_finish()`. This is the next most
common type of RAIDZ/DRAID checksum error.
3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
case 2), but there happens to be additional copies of this block due to
"ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
copies is good, then RAIDZ/DRAID compares each sector of the data or
parity that it retrieved with the good data from the other DVA, and if
they differ then it reports a checksum error on this child. This
differs from case 2 in that the checksum error is reported on only the
subset of children that actually have bad data or parity. This case
happens very rarely, since normally only metadata has ditto blocks. If
the silent damage is extensive, there will be many instances of case 2,
and the pool will likely be unrecoverable.
The code for handling case 3 is considerably more complicated than the
other cases, for two reasons:
1. It needs to run after the main raidz read logic has completed. The
data RAIDZ read needs to be preserved until after the alternate DVA has
been read, which necessitates refcounts and callbacks managed by the
non-raidz-specific zio layer.
2. It's nontrivial to map the sections of data read by RAIDZ to the
correct data. For example, the correct data does not include the parity
information, so the parity must be recalculated based on the correct
data, and then compared to the parity that was read from the RAIDZ
children.
Due to the complexity of case 3, the rareness of hitting it, and the
minimal benefit it provides above case 2, this commit removes the code
for case 3. These types of errors will now be handled the same as case
2, i.e. the checksum error will be reported against all children that
returned data or parity.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11735
The `rr_code` field in `raidz_row_t` is unused.
This commit removes the field, as well as the code that's used to set
it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11736
Resolve some oddities in zfsdev_close() which could result in a
panic and were not present in the equivalent function for Linux.
- Remove unused definition ZFS_MIN_MINOR
- FreeBSD: Simplify zfsdev state destruction
- Assert zs_minor is valid in zfsdev_close
- Make locking around zfsdev state match Linux
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11720
For some reason cppcheck 1.90 is generating an invalidSyntax warning
when the BF64_SET macro is used in the zstream source. The same
warning is not reported by cppcheck 2.3, nor is their any evident
problem with the expanded macro. This appears to be an issue with
this version of cppcheck. This commit annotates the source to suppress
the warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11700
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#11023
Making uio_impl.h the common header interface between Linux and FreeBSD
so both OS's can share a common header file. This also helps reduce code
duplication for zfs_uio_t for each OS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11622
Also fixes leak of the dlopen handle in the error case.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#11602
Fix regression seen in issue #11545 where checksum errors
where not being counted or showing up in a zpool event.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#11609
Property to allow sets of features to be specified; for compatibility
with specific versions / releases / external systems. Influences
the behavior of 'zpool upgrade' and 'zpool create'. Initial man
page changes and test cases included.
Brief synopsis:
zpool create -o compatibility=off|legacy|file[,file...] pool vdev...
compatibility = off : disable compatibility mode (enable all features)
compatibility = legacy : request that no features be enabled
compatibility = file[,file...] : read features from specified files.
Only features present in *all* files will be enabled on the
resulting pool. Filenames may be absolute, or relative to
/etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc
checked first).
Only affects zpool create, zpool upgrade and zpool status.
ABI changes in libzfs:
* New function "zpool_load_compat" to load and parse compat sets.
* Add "zpool_compat_status_t" typedef for compatibility parse status.
* Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum
* Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum
An initial set of base compatibility sets are included in
cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is
modified to install these in $pkgdatadir/compatibility.d and to
create symbolic links to a reasonable set of aliases.
Reviewed-by: ericloewe
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes#11468
FreeBSD's zfsd fails to build after e2af2acce3 due to strict type
checking errors from the implicit conversion between bool and boolean_t
in the inline predicate definitions in abd.h.
Use conditionals to return the correct value type from these functions.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#11592
zfs_znode_update_vfs is a more platform-agnostic name than
zfs_inode_update. Besides that, the function's prototype is moved to
include/sys/zfs_znode.h as the function is also used in common code.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ka Ho Ng <khng300@gmail.com>
Sponsored by: The FreeBSD Foundation
Closes#11580
ABD's currently track their parent/child relationship. This applies to
`abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on
knowing this relationship, it's only used for consistency checks to
verify that we are not destroying an ABD that's still in use. When we
are creating/destroying ABD's frequently, the performance impact of
maintaining these data structures (in particular the atomic
increment/decrement operations) can be measurable.
This commit removes this verification code on production builds, but
keeps it when ZFS_DEBUG is set.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11535
The runtime of vdev_validate is dominated by the disk accesses in
vdev_label_read_config. Speed it up by validating all vdevs in
parallel using a taskq.
Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#11470
metaslab_init is the slowest part of importing a mature pool, and it
must be repeated hundreds of times for each top-level vdev. But its
speed is dominated by a few serialized disk accesses. That can lead to
import times of > 1 hour for pools with many top-level vdevs on spinny
disks.
Speed up the import by using a taskqueue to parallelize vdev_load across
all top-level vdevs.
This also requires adding mutex protection to
metaslab_class_t.mc_historgram. The mc_histogram fields were
unprotected when that code was first written in "Illumos 4976-4984 -
metaslab improvements" (OpenZFS
f3a7f6610f). The lock wasn't added until
3dfb57a35e, though it's unclear exactly
which fields it's supposed to protect. In any case, it wasn't until
vdev_load was parallelized that any code attempted concurrent access to
those fields.
Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#11470
Mixing ZIL and normal allocations has several problems:
1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed. This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.
2. When under moderate load, ZIL allocations are of 128KB. If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more. The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC. All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading. This can cause a significant performance
impact.
3. If the pool is very fragmented, there may be zero free chunks of
128KB or more. In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.
These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.
This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class). From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.
Log (ZIL) blocks can be allocated from the following locations. Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)
The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.
On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes. On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11389
In FreeBSD the struct uio was just a typedef to uio_t. In order to
extend this struct, outside of the definition for the struct uio, the
struct uio has been embedded inside of a uio_t struct.
Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear
this is a ZFS interface.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11438
The `abd_get_offset_*()` routines create an abd_t that references
another abd_t, and doesn't allocate any pages/buffers of its own. In
some workloads, these routines may be called frequently, to create many
abd_t's representing small pieces of a single large abd_t. In
particular, the upcoming RAIDZ Expansion project makes heavy use of
these routines.
This commit adds the ability for the caller to allocate and provide the
abd_t struct to a variant of `abd_get_offset_*()`. This eliminates the
cost of allocating the abd_t and performing the accounting associated
with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for
the `rc_abd`, which references the zio's abd. The upcoming RAIDZ
Expansion project will leverage this infrastructure to increase
performance of reads post-expansion by around 50%.
Additionally, some of the interfaces around creating and destroying
abd_t's are cleaned up. Most significantly, the distinction between
`abd_put()` and `abd_free()` is eliminated; all types of abd_t's are
now disposed of with `abd_free()`.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #8853Closes#11439
Each zfs ioctl that changes on-disk state (e.g. set property, create
snapshot, destroy filesystem) is recorded in the zpool history, and is
printed by `zpool history -i`.
For performance diagnostic purposes, it would be useful to know how long
each of these ioctls took to run. This commit adds that functionality,
with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist.
Additionally, the time recorded in this history log is currently the
time that the history record is written to disk. But in many cases (CLI
args logging and ioctl logging), this happens asynchronously,
potentially many seconds after the operation completed. This commit
changes the timestamp to reflect when the history event was created,
rather than when it was written to disk.
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11440
Build error on illumos with gcc 10 did reveal:
In function 'dmu_objset_refresh_ownership':
../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion
from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'}
[-Werror=enum-conversion]
857 | dsl_dataset_disown(ds, decrypt, tag);
| ^~~~~~~
cc1: all warnings being treated as errors
libzfs_input_check.c: In function 'zfs_ioc_input_tests':
libzfs_input_check.c:754:28: error: implicit conversion from
'enum dmu_objset_type' to 'enum lzc_dataset_type'
[-Werror=enum-conversion]
754 | err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0);
| ^~~~~~~~~~~
cc1: all warnings being treated as errors
The same issue is present in openzfs, and also the same issue about
ds_hold_flags_t, which currently defines exactly one valid value.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes#11406
On a system with very high fragmentation, we may need to do lots of gang
allocations (e.g. most indirect block allocations (~50KB) may need to
gang). Before failing a "normal" allocation and resorting to ganging, we
try every metaslab. This has the impact of loading every metaslab (not
a huge deal since we now typically keep all metaslabs loaded), and also
iterating over every metaslab for every failing allocation. If there are
many metaslabs (more than the typical ~200, e.g. due to vdev expansion
or very large vdevs), the CPU cost of this iteration can be very
impactful. This iteration is done with the mg_lock held, creating long
hold times and high lock contention for concurrent allocations,
ultimately causing long txg sync times and poor application performance.
To address this, this commit changes the behavior of "normal" (not
try_hard, not ZIL) allocations. These will now only examine the 100
best metaslabs (as determined by their ms_weight). If none of these
have a large enough free segment, then the allocation will fail and
we'll fall back on ganging.
To accomplish this, we will now (normally) gang before doing a
`try_hard` allocation. Non-try_hard allocations will only examine the
100 best metaslabs of each vdev. In summary, we will first try normal
allocation. If that fails then we will do a gang allocation. If that
fails then we will do a "try hard" gang allocation. If that fails then
we will have a multi-layer gang block.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11327
Metaslab rotor and aliquot are used to distribute workload between
vdevs while keeping some locality for logically adjacent blocks. Once
multiple allocators were introduced to separate allocation of different
objects it does not make much sense for different allocators to write
into different metaslabs of the same metaslab group (vdev) same time,
competing for its resources. This change makes each allocator choose
metaslab group independently, colliding with others only sporadically.
Test including simultaneous write into 4 files with recordsize of 4KB
on a striped pool of 30 disks on a system with 40 logical cores show
reduction of vdev queue lock contention from 54 to 27% due to better
load distribution. Unfortunately it won't help much ZVOLs yet since
only one dataset/ZVOL is synced at a time, and so for the most part
only one allocator is used, but it may improve later.
While there, to reduce the number of pointer dereferences change
per-allocator storage for metaslab classes and groups from several
separate malloc()'s to variable length arrays at the ends of the
original class and group structures.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#11288
Building the spa module for i386 caused gcc to emit
-Wint-to-pointer-cast "cast to pointer from integer of different size"
because spa.spa_did was uint64_t but pthread_join (via thread_join in
spa_deactivate) takes a pointer (32-bit on i386). Define spa_did to be
pointer-size instead. For now spa_did is in fact never non-zero and the
thread_join could instead be ifdef'd out, but changing the size of
spa_did may be more useful for the future.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Closes#11336
The last change caused the read completion callback to not be called
if the IO was still in progress. This change restores allocation
of the arc buf callback, but in the callback path checks the new
acb_nobuf field to know to skip buffer allocation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11324
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11105
ZFS currently doesn't react to hotplugging cpu or memory into the
system in any way. This patch changes that by adding logic to the ARC
that allows the system to take advantage of new memory that is added
for caching purposes. It also adds logic to the taskq infrastructure
to support dynamically expanding the number of threads allocated to a
taskq.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#11212
Add ARC_FLAG_NO_BUF to indicate that a buffer need not be
instantiated. This fixes a ~20% performance regression on
cached reads due to zfetch changes.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11220Closes#11232
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds! Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.
This patch addresses the latency from two sides:
- by using _min_active queue depths for non-interactive requests while
the interactive request(s) are active and few requests after;
- by throttling it further if no interactive requests has completed
while configured amount of non-interactive did.
While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities. It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.
I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool. On fragmented pool I
also saw improvements, but not so dramatic. Below are log2 histograms
of the random read latency in milliseconds for different devices:
4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0
, that means maximum latency reduction from 2s to 500ms.
4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0
, i.e. from 4s to 250ms.
1 SAS HDD SEAGATE ST14000NM0048 before:
0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0
, i.e. from 4s to 125ms.
1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90
I've also measured scrub time during the test and on idle pools. On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before. On idle non-fragmented pool
I've measured no difference. On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#11166
- Don't leave fstrans set when passed a snapshot
- Don't remove minor if volmode already matches new value
- (FreeBSD) Wait for GEOM ops to complete before trying
remove (at create time GEOM will be "tasting" in parallel)
- (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL
- (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11199
For encrypted receives, where user accounting is initially disabled on
creation, both 'zfs userspace' and 'zfs groupspace' fails with
EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to
set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a
successful dmu_objset_space_upgrade().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9501Closes#9596
The output of ZFS channel programs is logged on-disk in the zpool
history, and printed by `zpool history -i`. Channel programs can use
10MB of memory by default, and up to 100MB by using the `zfs program -m`
flag. Therefore their output can be up to some fraction of 100MB.
In addition to being somewhat wasteful of the limited space reserved for
the pool history (which for large pools is 1GB), in extreme cases this
can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in
`dmu_buf_hold_array_by_dnode()`.
This commit limits the output size that will be logged to 1MB. Larger
outputs will not be logged, instead a entry will be logged indicating
the size of the omitted output.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11194
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10102
The field is yet another leftover from unsupported zfs_znode_move.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11186
The oid comes from the znode we are already passing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11176
In Linux 5.10 the linux/frame.h header was renamed linux/objtool.h.
Add a configure check to detect and use the correctly named header.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11085
Move zfs_get_data() in to platform-independent code. The only
platform-specific aspect of it is the way we release an inode
(Linux) / vnode_t (FreeBSD). I am not aware of a platform that
could be supported by ZFS that couldn't implement zfs_rele_async
itself. It's sibling zvol_get_data already is platform-independent.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#10979
Current CPU_SEQID users don't care about possibly changing CPU ID, but
enclose it within kpreempt disable/enable in order to fend off warnings
from Linux's CONFIG_DEBUG_PREEMPT.
There is no need to do it. The expected way to get CPU ID while allowing
for migration is to use raw_smp_processor_id.
In order to make this future-proof this patch keeps CPU_SEQID as is and
introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers
explicitly want this behavior.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11142
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11125
Note that this only tracks sizes as requested by the caller.
Actual allocated space will almost always be bigger (e.g., rounded up to
the next power of 2 or page size). Additionally the allocated buffer may
be holding other areas hostage. Nonetheless, this is a starting point
for tracking memory usage in zstd.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11129
The original xuio zero copy functionality has always been unused
on Linux and FreeBSD. Remove this disabled code to avoid any
confusion and improve readability.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11124
Refer to the correct section or alternative for FreeBSD and Linux.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11132
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD. With a little refactoring they can be
moved to the common code which is what is done by this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11078
The acltype property is currently hidden on FreeBSD and does not
reflect the NFSv4 style ZFS ACLs used on the platform. This makes it
difficult to observe that a pool imported from FreeBSD on Linux has a
different type of ACL that is being ignored, and vice versa.
Add an nfsv4 acltype and expose the property on FreeBSD.
Make the default acltype nfsv4 on FreeBSD.
Setting acltype to an unhanded style is treated the same as setting
it to off. The ACLs will not be removed, but they will be ignored.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10520
In FreeBSD, there are three compile environments that are supported:
user land, the kernel and the bootloader / standalone. Adjust the
headers to compile in the standalone environment. Limit kernel-only
items from view when _STANDALONE is defined.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Warner Losh <imp@FreeBSD.org>
Closes#10998
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#11047
Code cleanup, a follow up commit to 4d55ea81.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#11020
This change updates the documentation to refer to the project
as OpenZFS instead ZFS on Linux. Web links have been updated
to refer to https://github.com/openzfs/zfs. The extraneous
zfsonlinux.org web links in the ZED and SPL sources have been
dropped.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11007
In C, const indicates to the reader that mutation will not occur.
It can also serve as a hint about ownership.
Add const in a few places where it makes sense.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#10997
In non regular use cases allocated memory might stay persistent in memory
pool. This small patch checks every minute if there are old objects which
can be released from memory pool.
Right now with regular use, the pool is checked for old objects on each
allocation attempt from this pool. so basically polling by its use. Now
consider what happens if someone writes a lot of files and stops use of
the volume or even unmounts it. So the code will no longer check if
objects can be released from the pool. Already allocated objects will
still stay in pool cache. this is no big issue for common use. But
someone discovered this issue while doing tests. personally i know this
behavior and I'm aware of it. Its no big issue. just a enhancement
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes#10938Closes#10969
The m4 objtool configure check can incorrectly fail because of a
missing header in the test. This appears to be the result of a
recent kernel change and was observed on the Fedora 5.8.11-200
kernel.
In file included from /home/fedora/zfs/build/objtool/objtool.c:75:
./arch/x86/include/asm/frame.h💯57: error: 'struct pt_regs'
declared inside parameter list will not be visible outside
of this definition or declaration [-Werror]
The consequence of this is that the "stack_frame_non_standard"
check is never run and HAVE_STACK_FRAME_NON_STANDARD is set
incorrectly which results in a build failure. This change adds
the appropriate header to the "objtool" check so it now behaves
as intended.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10990
The current dmu_zfetch code implicitly assumes that I/Os complete
within min_sec_reap seconds. With async dmu and a readonly workload
(and thus no exponential backoff in operations from the "write
throttle") such as L2ARC rebuild it is possible to saturate the drives
with I/O requests. These are then effectively compounded with prefetch
requests.
This change reference counts streams and prevents them from being
recycled after their min_sec_reap timeout if they still have
outstanding I/Os.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10900
The procfs_list interface is required by several kstats. Implement
this functionality for FreeBSD to provide access to these kstats.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10890
Use the same reduced buffer size for lauxlib that is used on Linux.
Fixes panic on HEAD in lua gsub test designed to exhaust stack space.
With this we can remove the special case to reserve more stack space
on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10959
== Motivation and Context
The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.
In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.
== Description
This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes#10932
Prefer acltype=off|posix, retaining the old names as aliases.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10918
nvlist does allow us to support different data types and systems.
To encapsulate user data to/from nvlist, the libzfsbootenv library is
provided.
Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes#10774
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10743
When hz > 1000, msec / (1000 / hz) results in division by zero.
I found somewhere in FreeBSD using howmany(msec * hz, 1000) to convert
ms to ticks, avoiding the potential for a zero in the divisor.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#10894
Duplicate io and checksum ereport events can misrepresent that
things are worse than they seem. Ideally the zpool events and the
corresponding vdev stat error counts in a zpool status should be
for unique errors -- not the same error being counted over and over.
This can be demonstrated in a simple example. With a single bad
block in a datafile and just 5 reads of the file we end up with a
degraded vdev, even though there is only one unique error in the pool.
The proposed solution to the above issue, is to eliminate duplicates
when posting events and when updating vdev error stats. We now save
recent error events of interest when posting events so that we can
easily check for duplicates when posting an error.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#10861
If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with
`dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC.
However, there is no way to notice or communicate this failure, so it's
extremely difficult to use this functionality correctly, and in fact
almost all callers use `ZFS_SPACE_CHECK_NONE`.
This commit removes the `zfs_space_check_t` argument from
`dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10855
There are a number of places where cv_?_sig is used simply for
accounting purposes but the surrounding code has no ability to
cope with actually receiving a signal. On FreeBSD it is possible
to send signals to individual kernel threads so this could
enable undesirable behavior.
This patch adds routines on Linux that will do the same idle
accounting as _sig without making the task interruptible. On
FreeBSD cv_*_idle are all aliases for cv_*
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10843
Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' property set to
'legacy' or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even
mounted back.
This introduces layering violation, as we need to update
'f_mntfromname' field in statfs structure related to mountpoint (for
the dataset we are renaming and all its children).
In my opinion it is worth it, as it allow to update FreeBSD in even
cleaner way - in ZFS-only configuration root file system is ZFS file
system with 'mountpoint' property set to 'legacy'. If root dataset is
named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone
it (system/oldrootfs), update FreeBSD and if it doesn't boot we can
boot back from system/oldrootfs and rename it back to system/rootfs
while it is mounted as /. Before it was not possible, because
unmounting / was not possible.
Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Matt Macy <mmacy@freebsd.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10839
For Linux, when zfs is compiled as an in kernel static variant
and the in kernel zstd library is compiled statically into the kernel
a symbol collision will occur. This wrapper header renames all
of the relevant zstd functions to avoid this problem.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes#10775
Increase the size of DDT_NAMELEN and MNT_LINE_MAX to appease GCC
snprintf truncation warnings.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris McDonough <chrism@plope.com>
Closes#10712Closes#10766
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.
Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:
1. Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.
2. The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.
Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Macy <mmacy@freebsd.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10619
Removing other_size from arc_stats breaks top in 11.x jails
running on HEAD.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10745
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#6247Closes#9024Closes#10277Closes#10278
In FreeBSD trim has defaulted to on for several
years. In order to minimize POLA violations on
import it's important to maintain this default
when importing vendored openzfs in to FreeBSD
base.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10719
We limit the size of nvlists passed to the kernel so a user cannot make
the kernel do an unreasonably large allocation. On FreeBSD this limit
was 128 kiB, which turns out to be a bit too small when doing some
operations involving a large number of datasets or snapshots, for
example replication.
Make this limit tunable, with a platform-specific auto default.
Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the
system limit on user wired memory, which allows it to scale depending
on system configuration.
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Issue #6572Closes#10706
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10701
`KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented
caches are `KMC_KVMEM`.
KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we
don't use kmalloc to back the SPL caches, instead we use kvmalloc
(KMC_KVMEM). The flag, module parameter, /proc entries, and associated
code are removed.
KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to
vmalloc(). The flag, /proc entries, and associated code are removed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10673
Up until now zpool.cache has always lived in /boot on FreeBSD.
For the sake of compatibility fallback to /boot if zpool.cache
isn't found in /etc/zfs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10720
This was previously moved because nothing else in-tree uses it, but
evidently DilOS uses it out of tree.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@freebsd.org>
Closes#10361Closes#10685
ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist
With a compression= property outside of the understood range:
Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.
After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10631
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10600
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10620
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#10630
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10621
This is a step toward being able to vendor the OpenZFS code in FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10625
* libspl: umem: These are obviously and intentionally unused; annotate
them as such to appease -Wunused-parameter builds that include this
header.
* sys/dmu.h: In this case, clear_on_evict_dbufp is only used for
ZFS_DEBUG builds, so annotate it as __maybe_unused to appease
-Wunused-parameter.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes#10606
Drop unnecessary redefinition's of several arcstat values.
Put missing extern declaration of arc_no_grow_shift in arc_impl.h.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10609
The process of evicting data from the ARC is referred to as
`arc_adjust`.
This commit changes the term to `arc_evict`, which is more specific.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10592
These tunables were renamed from vfs.zfs.arc_min and
vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max.
Add legacy compat tunables for the old names.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10579
Livelists and spacemaps are data structures that are logs of allocations
and frees. Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.
These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).
ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab. However, it
generally halts processing when the error is detected.
When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.
The specific checks added are:
Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]
Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
spacemap log, then report any double FREEs and double ALLOCs
Verify that livelists are consistenet with spacemaps. The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes#10515
== Motivation and Context
The current implementation of 'sharenfs' and 'sharesmb' relies on
the use of the sharetab file. The use of this file is os-specific
and not required by linux or freebsd. Currently the code must
maintain updates to this file which adds complexity and presents
a significant performance impact when sharing many datasets. In
addition, concurrently running 'zfs sharenfs' command results in
missing entries in the sharetab file leading to unexpected failures.
== Description
This change removes the sharetab logic from the linux and freebsd
implementation of 'sharenfs' and 'sharesmb'. It still preserves an
os-specific library which contains the logic required for sharing
NFS or SMB. The following entry points exist in the vastly simplified
libshare library:
- sa_enable_share -- shares a dataset but may not commit the change
- sa_disable_share -- unshares a dataset but may not commit the change
- sa_is_shared -- determine if a dataset is shared
- sa_commit_share -- notify NFS/SMB subsystem to commit the shares
- sa_validate_shareopts -- determine if sharing options are valid
The sa_commit_share entry point is provided as a performance enhancement
and is not required. The sa_enable_share/sa_disable_share may commit
the share as part of the implementation. Libshare provides a framework
for both NFS and SMB but some operating systems may not fully support
these protocols or all features of the protocol.
NFS Operation:
For linux, libshare updates /etc/exports.d/zfs.exports to add
and remove shares and then commits the changes by invoking
'exportfs -r'. This file, is automatically read by the kernel NFS
implementation which makes for better integration with the NFS systemd
service. For FreeBSD, libshare updates /etc/zfs/exports to add and
remove shares and then commits the changes by sending a SIGHUP to
mountd.
SMB Operation:
For linux, libshare adds and removes files in /var/lib/samba/usershares
by calling the 'net' command directly. There is no need to commit the
changes. FreeBSD does not support SMB.
== Performance Results
To test sharing performance we created a pool with an increasing number
of datasets and invoked various zfs actions that would enable and
disable sharing. The performance testing was limited to NFS sharing.
The following tests were performed on an 8 vCPU system with 128GB and
a pool comprised of 4 50GB SSDs:
Scale testing:
- Share all filesystems in parallel -- zfs sharenfs=on <dataset> &
- Unshare all filesystems in parallel -- zfs sharenfs=off <dataset> &
Functional testing:
- share each filesystem serially -- zfs share -a
- unshare each filesystem serially -- zfs unshare -a
- reset sharenfs property and unshare -- zfs inherit -r sharenfs <pool>
For 'zfs sharenfs=on' scale testing we saw an average reduction in time
of 89.43% and for 'zfs sharenfs=off' we saw an average reduction in time
of 83.36%.
Functional testing also shows a huge improvement:
- zfs share -- 97.97% reduction in time
- zfs unshare -- 96.47% reduction in time
- zfs inhert -r sharenfs -- 99.01% reduction in time
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Bryant G. Ly <bryangly@gmail.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-68690
Closes#1603Closes#7692Closes#7943Closes#10300
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit." Two types of users are allowed to change
the limit:
1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`. This works properly.
2. A user with elevated system privileges (e.g. root). This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.
The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process. This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.
This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged. Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.
This problem only impacts Linux. Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8226Closes#10545
zfs_rangelock_tryenter() bails immediately instead of waiting for the
lock to become available. This will be used to resolve a deadlock in
the FreeBSD page-in code. No functional change intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#10519
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices. However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10349
OS-specific code (e.g. under `module/os/linux`) does not need to share
its code structure with any other operating systems. In particular, the
ARC and kmem code need not be similar to the code in illumos, because we
won't be syncing this OS-specific code between operating systems. For
example, if/when illumos support is added to the common repo, we would
add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of
this code.
Therefore, we can simplify the code in the OS-specific ARC and kmem
routines.
These changes do not impact system behavior, they are purely code
cleanup. The changes are:
Arenas are not used on Linux or FreeBSD (they are always `NULL`), so
`heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along
with code that uses them.
In `arc_available_memory()`:
* `desfree` is unused, remove it
* rename `freemem` to avoid conflict with pre-existing `#define`
* remove checks related to arenas
* use units of bytes, rather than converting from bytes to pages and
then back to bytes
`SPL_KMEM_CACHE_REAP` is unused, remove it.
`skc_reap` is unused, remove it.
The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove
it.
`vmem_size()` and associated type and macros are unused, remove them.
In `arc_memory_throttle()`, use a less confusing variable name to store
the result of `arc_free_memory()`.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10499
The kernel headers are installed for DKMS on linux, so don't install
them unless we're building on linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10506
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.
Add some files that were missing altogether in various Makefile's.
The changes to disted files in this commit (excluding deleted files):
+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10501
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`. Before commit 3ec34e5527, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`. However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.
Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`. Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC. During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).
To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5527.
Note: commit 3ec34e5527 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.
Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.
To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes#10496
There's no need to specify the srcdir explicitly in _HEADERS and
EXTRA_DIST.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10493
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.
Add the handler so the tunable can be declared in common code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10490
On Illumos callers of cv_timedwait and cv_timedwait_hires
can't distinguish between whether or not the cv was signaled
or the call timed out. Illumos handles this (for some definition
of handles) by calling cv_signal in the return path if we were
signaled but the return value indicates instead that we timed
out. This would make sense if it were possible to query the the
cv for its net signal disposition. However, this isn't possible
and, in spite of the fact that there are places in the code that
clearly take a different and incompatible path if a timeout value
is indicated, this distinction appears to be rather subtle to most
developers. This problem is further compounded by the fact that on
Linux, calling cv_signal in the return path wouldn't even do the
right thing unless there are other waiters.
Since it is possible for the caller to independently determine how
much time is remaining but it is not possible to query if the cv
was in fact signaled, prioritizing signalling over timeout seems
like a cleaner solution. In addition, judging from usage patterns
within the code itself, it is also less error prone.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10471
Apparently missed in the initial port integration was
the need to reap the abd_chunk_cache on FreeBSD. This
change addresses that oversight.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10474
When compiling with Clang++ it does not allow for untagged structs, so
struct ddr_begin needs to be declared before the struct that uses it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10453
Linux defines different vdev_disk_t members to macOS, but they are
only used in vdev_disk.c so move the declaration there.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10452
gcc10.1 complains with:
../../include/sys/dmu.h:373:24: error: ‘%s’ directive output may be
truncated writing up to 95 bytes into a region of size 75
[-Werror=format-truncation=]
373 | #define DMU_POOL_DDT "DDT-%s-%s-%s"
| ^~~~~~~~~~~~~~
../../module/zfs/ddt.c:256:37: note: in expansion of macro
‘DMU_POOL_DDT’
256 | (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
| ^~~~~~~~~~~~
../../include/sys/dmu.h:373:32: note: format string is defined here
373 | #define DMU_POOL_DDT "DDT-%s-%s-%s"
| ^~
../../module/zfs/ddt.c:256:9: note: ‘snprintf’ output 7 or more bytes
(assuming 102) into a destination of size 80
256 | (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
257 | zio_checksum_table[ddt->ddt_checksum].ci_name,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
258 | ddt_ops[type]->ddt_op_name, ddt_class_name[class]);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Increasing DTT_NAMELEN fixes it.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10433
For MIPS architectures on Linux the ZERO_PAGE macro references
empty_zero_page, which is exported as a GPL symbol. The call to
ZERO_PAGE in abd_alloc_zero_scatter has been removed and a single
zero'd page is now allocated for each of the pages in abd_zero_scatter
in the kernel ABD code path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10428
Correct various typos in the comments and tests.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#10423
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#6224Closes#10383
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.
We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.
We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).
We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#9713Closes#9789Closes#10224
In Illumos it is possible to call ioctl functions from within the
kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support
that, but it doesn't hurt to keep it around, as all the code is there.
Before this commit it was a dead code and zc_iflags was always zero.
Restore this functionality by allowing to pass a flag to the
zfsdev_ioctl_common() function.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#10417
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used. If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().
The biggest change is adding a size parameter to zfs_id_to_fuidstr().
The various *_impl_get() functions are only used on linux and have
not yet been updated.
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10400
C++ is a little picky about not using keywords for names, or string
constness.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10409
Expand the FreeBSD spl for kstats to support all current types
Move the dataset_kstats_t back to zvol_state_t from zfs_state_os_t
now that it is common once again
```
kstat.zfs/mypool.dataset.objset-0x10b.nunlinked: 0
kstat.zfs/mypool.dataset.objset-0x10b.nunlinks: 0
kstat.zfs/mypool.dataset.objset-0x10b.nread: 150528
kstat.zfs/mypool.dataset.objset-0x10b.reads: 48
kstat.zfs/mypool.dataset.objset-0x10b.nwritten: 134217728
kstat.zfs/mypool.dataset.objset-0x10b.writes: 1024
kstat.zfs/mypool.dataset.objset-0x10b.dataset_name: mypool/datasetname
```
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10386
The macOS kmem implementation uses avl_update() and related
functions. These same function exist in the Solaris AVL code but
were removed because they were unused. Restore them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10390
Adding the gang ABD type, which allows for linear and scatter ABDs to
be chained together into a single ABD.
This can be used to avoid doing memory copies to/from ABDs. An example
of this can be found in vdev_queue.c in the vdev_queue_aggregate()
function.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian <bwa@clemson.edu>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10069
We only use ZVOL_DIR on FreeBSD, and on FreeBSD it isn't correct.
Move the definition to the file where it is needed, and define it as
/dev/zvol/.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10337
Commit fc551d7 introduced the wrappers abd_enter_critical() and
abd_exit_critical() to mark critical sections. On Linux these are
implemented with the local_irq_save() and local_irq_restore() macros
which set the 'flags' argument when saving. By wrapping them with
a function the local variable is no longer set by the macro and is
no longer properly restored.
Convert abd_enter_critical() and abd_exit_critical() to macros to
resolve this issue and ensure the flags are properly restored.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10332
The member drc_err of dmu_recv_cookie_t is used only locally in
receive_read, so we can replace it with a local variable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10319
Reorganizing ABD code base so OS-independent ABD code has been placed
into a common abd.c file. OS-dependent ABD code has been left in each
OS's ABD source files, and these source files have been renamed to
abd_os.
The OS-independent ABD code is now under:
module/zfs/abd.c
With the OS-dependent code in:
module/os/linux/zfs/abd_os.c
module/os/freebsd/zfs/abd_os.c
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10293
Functional changes:
We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.
1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
reflect the total aligned size of log blocks on the device. This is
also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
which reflect the total number of L2ARC log blocks present on cache
devices. It is also reflected in the header of the cache device as
"dh_lb_count".
In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.
Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.
If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().
If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.
In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.
Non-functional changes:
Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.
Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.
Fix inaccurate comments for example in l2arc_log_blk_restore().
Add asserts at the end in l2arc_evict() and l2arc_write_buffers().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#10228
Modern bootloaders leverage data stored in the root filesystem to
enable some of their powerful features. GRUB specifically has a grubenv
file which can store large amounts of configuration data that can be
read and written at boot time and during normal operation. This allows
sysadmins to configure useful features like automated failover after
failed boot attempts. Unfortunately, due to the Copy-on-Write nature
of ZFS, the standard behavior of these tools cannot handle writing to
ZFS files safely at boot time. We need an alternative way to store
data that allows the bootloader to make changes to the data.
This work is very similar to work that was done on Illumos to enable
similar functionality in the FreeBSD bootloader. This patch is different
in that the data being stored is a raw grubenv file; this file can store
arbitrary variables and values, and the scripting provided by grub is
powerful enough that special structures are not required to implement
advanced behavior.
We repurpose the second padding area in each label to store the grubenv
file, protected by an embedded checksum. We add two ioctls to get and
set this data, and libzfs_core and libzfs functions to access them more
easily. There are no direct command line interfaces to these functions;
these will be added directly to the bootloader utilities.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10009
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated. Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.
This commit removes the code for sending and receiving deduplicated send
streams. `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream. `zfs receive` of
a deduplicated send stream will print an error message and fail.
The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.
Several new tests are added which leverage `zstream redup`.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156Closes#10212
Each metaslab group (of which there is one per top-level vdev) has
several (4, by default) "metaslab group allocators". Each "allocator"
has its own metaslab that it prefers to allocate from (the "primary"
allocator), and each can perform allocations concurrently with the other
allocators. In addition to the primary metaslab, there are several
other fields that need to be tracked separately for each allocator.
These are currently stored as several arrays in the metaslab_group_t,
each array indexed by allocator number.
This change organizes all the metaslab-group-allocator-specific fields
into a new struct, metaslab_group_allocator_t. The metaslab_group_t now
needs only one array indexed by the allocator number - which contains
the metaslab_group_allocator_t's.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10213
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#925Closes#1823Closes#2672Closes#3744Closes#9582
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.
Add warning messages when tunables are being ignored.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10157Closes#10158
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10178
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.
To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.
With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10179
Add a mechanism to wait for delete queue to drain.
When doing redacted send/recv, many workflows involve deleting files
that contain sensitive data. Because of the way zfs handles file
deletions, snapshots taken quickly after a rm operation can sometimes
still contain the file in question, especially if the file is very
large. This can result in issues for redacted send/recv users who
expect the deleted files to be redacted in the send streams, and not
appear in their clones.
This change duplicates much of the zpool wait related logic into a
zfs wait command, which can be used to wait until the internal
deleteq has been drained. Additional wait activities may be added
in the future.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9707
Linux changed the default max ARC size to 1/2 of physical memory to
deal with shortcomings of the Linux SLUB allocator. Other platforms
do not require the same logic.
Implement an arc_default_max() function to determine a default max ARC
size in platform code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10155
Make the cityhash code compile into libzfs, in preparation for the new
"zstream" command.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10152
This change adds a separate return code to zfs_ioc_recv that is used
for incomplete streams, in addition to the existing return code for
streams that contain corruption.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#10122
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9662Closes#10115
dnode_special_close() waits for the refcount of dn_holds to go to zero
without holding the dn_mtx. dnode_rele_and_unlock() does the final
remove to dn_holds with dn_mtx being held:
refs = zfs_refcount_remove(&dn->dn_holds, tag);
mutex_exit(&dn->dn_mtx);
So, there is a race condition after the remove until dn_mtx is
dropped. During that time, dnode_destroy() can get called, which ends
up in dnode_dest() calling mutex_destroy() and a panic since the lock
is still held.
This change adds a condvar to wait for the final dnode_rele_and_unlock()
to release the dn_mtx before calling dnode_destroy().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes#7814Closes#10101
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10067
Manual trims fall into the category of long-running pool activities
which people might want to wait synchronously for. This change adds
support to 'zpool wait' for waiting for manual trim operations to
complete. It also adds a '-w' flag to 'zpool trim' which can be used to
turn 'zpool trim' into a synchronous operation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#10071
__zio_execute() calls zio_taskq_member() to determine if we are running
in a zio interrupt taskq, in which case we may need to switch to
processing this zio in a zio issue taskq. The call to
zio_taskq_member() can become a performance bottleneck when we are
processing a high rate of zio's.
zio_taskq_member() calls taskq_member() on each of the zio interrupt
taskqs, of which there are 21. This is slow because each call to
taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively
slow.
This commit improves the performance of zio_taskq_member() by having it
cache the value of tsd_get(taskq_tsd), reducing the number of those
calls to 1/21th of the current behavior.
In a test case running `zfs send -c >/dev/null` of a filesystem with
small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of
one CPU, and with this change it is reduced to 1.3%. Overall time to
perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000
blocks/sec).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10070
This function should only return "linux" on Linux.
Move the kernel part of the function out of common code.
Fix the tests for FreeBSD.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10079
By adding a zfs_file_private accessor to the common
interfaces and some extensions to FreeBSD platform
code it is now possible to share the implementations
for the aforementioned functions.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10073
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired. Callers are now required to use the time64_t type.
Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code. They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.
Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.
Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10052Closes#10064
* Add dedicated donde_set_dirtyctx routine.
* Add empty dirty record on destroy assertion.
* Make much more extensive use of the SET_ERROR macro.
Reviewed-by: Will Andrews <wca@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9924
The `convoff` function is called only in one code path in `zfs_space`.
Each caller of `zfs_space` is called with a `flock64_t` that has
`l_whence` set to `SEEK_SET`. This means that `convoff` always results
in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and
`int whence` is `SEEK_SET` as well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Closes#10006
Moving forward, we wish to use org.openzfs (no dash) rather than
org.open-zfs or org.zfsonlinux for feature GUIDs and property names.
The existing feature GUIDs cannot be changed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#10003
This adds support for setting user properties in a
zfs channel program by adding 'zfs.sync.set_prop'
and 'zfs.check.set_prop' to the ZFS LUA API.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Contributions-by: Jason King <jason.king@joyent.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes#9950
The module parameter zfs_async_block_max_blocks limits the number of
blocks that can be freed by the background freeing of filesystems and
snapshots (from "zfs destroy"), in one TXG. This is useful when freeing
dedup blocks, becuase each zio_free() of a dedup block can require an
i/o to read the relevant part of the dedup table (DDT), and will also
dirty that block.
zfs_async_block_max_blocks is set to 100,000 by default. For the more
typical case where dedup is not used, this can have a negative
performance impact on the rate of background freeing (from "zfs
destroy"). For example, with recordsize=8k, and TXG's syncing once
every 5 seconds, we can free only 160MB of data per second, which may be
much less than the rate we can write data.
This change increases zfs_async_block_max_blocks to be unlimited by
default. To address the dedup freeing issue, a new tunable is
introduced, zfs_max_async_dedup_frees, which limits the number of
zio_free()'s of dedup blocks done by background destroys, per txg. The
default is 100,000 free's (same as the old zfs_async_block_max_blocks
default).
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10000
Since AVL already has embedded element counter, use dn_dbufs_count
only for dbufs not counted there (bonus buffers) and just add them.
This removes two atomics per dbuf life cycle.
According to profiler it reduces time spent by dbuf_destroy() inside
bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9949
Add support for bookmark creation and cloning.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9571
This feature allows copying existing bookmarks using
zfs bookmark fs#target fs#newbookmark
There are some niche use cases for such functionality,
e.g. when using bookmarks as markers for replication progress.
Copying redaction bookmarks produces a normal bookmark that
cannot be used for redacted send (we are not duplicating
the redaction object).
ZCP support for bookmarking (both creation and copying) will be
implemented in a separate patch based on this work.
Overview:
- Terminology:
- source = existing snapshot or bookmark
- new/bmark = new bookmark
- Implement bookmark copying in `dsl_bookmark.c`
- create new bookmark node
- copy source's `zbn_phys` to new's `zbn_phys`
- zero-out redaction object id in copy
- Extend existing bookmark ioctl nvlist schema to accept
bookmarks as sources
- => `dsl_bookmark_create_nvl_validate` is authoritative
- use `dsl_dataset_is_before` check for both snapshot
and bookmark sources
- Adjust CLI
- refactor shortname expansion logic in `zfs_do_bookmark`
- Update man pages
- warn about redaction bookmark handling
- Add test cases
- CLI
- pyyzfs libzfs_core bindings
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#9571
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display. Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9640Closes#9729
We need to do the same thing to update all spas on any OS for these
tunables, so let's share the code.
While here let's match the types of the literals initializing the
variables with the type of the variable.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9964
Currently SIMD accelerated AES-GCM performance is limited by two
factors:
a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.
b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.
To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.
Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#9749
Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to
re-borrow after the flush. But since asc_borrowed and asc_delta are
accessed only while holding asc_lock, it makes no any sense to modify
as_lower_bound and as_upper_bound in multiple steps. Instead of that
the new code uses only 2 atomics in all the cases, one per as_*_bound
variable. I think even that is overkill, simple atomic store and
load could be used here, since all modifications are done under the
as_lock, but there are no such primitives in ZFS code now.
While there, make borrow code consider previous borrow value, so that
on mixed request patterns reduce chance of needing to borrow again if
much larger request follows tiny one that needed borrow.
Also reduce as_numbuckets from uint64_t to u_int. It makes no sense
to use so large division operation on every aggsum_add().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9930
Move db_link into the same cache line as db_blkid and db_level.
It allows significantly reduce avl_add() time in dbuf_create() on
systems with large RAM and huge number of dbufs per dnode.
Avoid few accesses to dbuf_caches[].size, which is highly congested
under high IOPS and never stays in cache for a long time. Use local
value we are receiving from zfs_refcount_add_many() any way.
Remove cache_size_bytes_max bump from dbuf_evict_one(). I don't see
a point to do it on dbuf eviction after we done it on insertion in
dbuf_rele_and_unlock().
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#9931
Additionally pull in state machine comments about
upcoming async cow work.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9902
This replaces the placeholder ZFS_PROP_PRIVATE with ZFS_PROP_ACLMODE,
matching what is done in the NFSv4 ACLs PR (#9709).
On FreeBSD we hide ZFS_PROP_ACLTYPE, while on Linux we hide
ZFS_PROP_ACLMODE.
The tests already assume this arrangement.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#9913
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes#7863Closes#9885
Implements the RAID-Z function using AltiVec SIMD.
This is basically the NEON code translated to AltiVec.
Note that the 'fletcher' algorithm requires 64-bits
operations, and the initial implementations of AltiVec
(PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to
32-bits operations, so no 'fletcher'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes#9539
This adds support in channel programs to inherit properties analogous
to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit`
functions to the ZFS LUA API.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes#9738
Currently, the handling for errata #4 has two issues which allow
the checks for this issue to be bypassed using resumable sends.
The first issue is that drc->drc_fromsnapobj is not set in the
resuming code as it is in the non-resuming code. This causes
dsl_crypto_recv_key_check() to skip its checks for the
from_ivset_guid. The second issue is that resumable sends do not
clean up their on-disk state if they fail the checks in
dmu_recv_stream() that happen before any data is received.
As a result of these two bugs, a user can attempt a resumable send
of a dataset without a from_ivset_guid. This will fail the initial
dmu_recv_stream() checks, leaving a valid resume state. The send
can then be resumed, which skips those checks, allowing the receive
to be completed.
This commit fixes these issues by setting drc->drc_fromsnapobj in
the resuming receive path and by ensuring that resumablereceives
are properly cleaned up if they fail the initial dmu_recv_stream()
checks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9818Closes#9829
Update the project website links contained in to repository to
reference the secure https://zfsonlinux.org address.
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9837
This commit adds the --saved (-S) to the 'zfs send' command.
This flag allows a user to send a partially received dataset,
which can be useful when migrating a backup server to new
hardware. This flag is compatible with resumable receives, so
even if the saved send is interrupted, it can be resumed.
The flag does not require any user / kernel ABI changes or any
new feature flags in the send stream format.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Christian Schwarz <me@cschwarz.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9007
The quota functions are common to all implementations and can be
moved to common code. As a simplification they were moved to the
Linux platform code in the initial refactoring.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9710
Change many of the znops routines to take a znode rather
than an inode so that zfs_replay code can be largely shared
and in the future the much of the znops code may be shared.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9708
The zfsvfs->z_sb field is Linux specified and should be abstracted.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9697
This change allows us to align the code dump logic across platforms.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9691
FreeBSD uses its own crypto framework in-kernel which, at this time,
has no EDONR implementation.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9664
FreeBSD requires three additional ioctls, they are ZFS_IOC_NEXTBOOT,
ZFS_IOC_JAIL, and ZFS_IOC_UNJAIL. These have been added after the
Linux-specific ioctls. The range 0x80-0xFF has been reserved for
future optional platform-specific ioctls. Any platform may choose
to implement these as appropriate.
None of the existing ioctl numbers have been changed to maintain
compatibility. For Linux no vectors have been registered for the
new ioctls and they are reported as unsupported.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9667
Update zfs_deadman_failmode to use the ZFS_MODULE_PARAM_CALL
wrapper, and split the common and platform specific portions.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9670
- on Linux move Linux specific headers to zfs_context_os.h
- on FreeBSD move FreeBSD specific definitions to zfs_context_os.h
- remove duplicate tsd_ definitions
- remove unused AT_TYPE
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9668
FreeBSD needs to cope with multiple version of the zfs_cmd_t
structure. Allowing the platform code to pre and post
process the cmd structure makes it possible to work with
legacy tooling.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9624
Linux and FreeBSD use different names for suid / setuid.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9632
FreeBSD needs to be able to pass the jail id to the jail/unjail ioctls
and the struct file in the device structure is unused.
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9625
- move linux/ includes to platform headers
- add void * io_bio to zio for tracking the underlying bio
- add freebsd specific fields to abd_scatter
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9615
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #840Closes#9155Closes#9378Closes#9551Closes#9588
Provide a common zfs_file_* interface which can be implemented on all
platforms to perform normal file access from either the kernel module
or the libzpool library.
This allows all non-portable vnode_t usage in the common code to be
replaced by the new portable zfs_file_t. The associated vnode and
kobj compatibility functions, types, and macros have been removed
from the SPL. Moving forward, vnodes should only be used in platform
specific code when provided by the native operating system.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9556
This adds a new KMC_KVMEM flag was added to enforce use of the
kvmalloc allocator in kmem_cache_create even for large blocks, which
may also increase performance in some specific cases (e.g. zstd), too.
Default to KVMEM instead of VMEM in spl_kmem_cache_create.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#9034
Increase the minimum supported kernel version from 2.6.32 to 3.10.
This removes support for the following Linux enterprise distributions.
Distribution | Kernel | End of Life
---------------- | ------ | -------------
Ubuntu 12.04 LTS | 3.2 | Apr 28, 2017
SLES 11 | 3.0 | Mar 32, 2019
RHEL / CentOS 6 | 2.6.32 | Nov 30, 2020
The following changes were made as part of removing support.
* Updated `configure` to enforce a minimum kernel version as
specified in the META file (Linux-Minimum: 3.10).
configure: error:
*** Cannot build against kernel version 2.6.32.
*** The minimum supported kernel version is 3.10.
* Removed all `configure` kABI checks and matching C code for
interfaces which solely predate the Linux 3.10 kernel.
* Updated all `configure` kABI checks to fail when an interface is
missing which was in the 3.10 kernel up to the latest 5.1 kernel.
Removed the HAVE_* preprocessor defines for these checks and
updated the code to unconditionally use the verified interface.
* Inverted the detection logic in several kABI checks to match
the new interface as it appears in 3.10 and newer and not the
legacy interface.
* Consolidated the following checks in to individual files. Due
the large number of changes in the checks it made sense to handle
this now. It would be desirable to group other related checks in
the same fashion, but this as left as future work.
- config/kernel-blkdev.m4 - Block device kABI checks
- config/kernel-blk-queue.m4 - Block queue kABI checks
- config/kernel-bio.m4 - Bio interface kABI checks
* Removed the kABI checks for sops->nr_cached_objects() and
sops->free_cached_objects(). These interfaces are currently unused.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9566
Some of the znode fields are different and functions
consuming an inode don't exist on FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9536
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.
Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.
Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.
Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#9525
MODULE_VERSION is already defined on FreeBSD. Wrap all of the
used MODULE_* macros for the sake of consistency and portability.
Add a user space noop version to reduce the need for _KERNEL ifdefs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9542
A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
This change is a follow up to 2cc479d0.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9534
The FreeBSD implementation can fail, allow this function to
fail and add the required error handling for Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9541
FreeBSD has its own sha routines that the port uses.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9530
Follow up to 511fce6b which missed a cast.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9533
Move these Linux module parameter get/set helpers in to
platform specific code.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9457
- FreeBSD's rootpool import code uses spa_config_parse
- Move the zvol_create_minors call out from under the
spa_namespace_lock in spa_import. It isn't needed and it causes
a lock order reversal on FreeBSD.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9499
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.
A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#9487Closes#9489
EBADE, EBADR, and ENOANO do not exist on FreeBSD
The libspl errno.h is similarly platform dependent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9498
It's mostly a noop on ZoL and it conflicts with platforms that
support dtrace. Remove this header to resolve the conflict.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9497
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.
This has the additional advantage of allowing us to use the FPU
again in user threads. So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9346Closes#9403
Factor Linux specific memory pressure handling out of ARC. Each
platform will have different available interfaces for managing memory
pressure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9472
Only pass the file descriptor to make zfsdev_get_miror() portable.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9466
We get the sizeof the appropriate type, and don't cast away const.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9455
Rename certain functions for more consistency when they share common
features. Make comments clearer about what arguments should be passed
to the insert and add functions.
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9441
FreeBSD uses this in its pager ops routines
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9431
FreeBSD's zvol platform code requires access to the
zil_async_to_sync() function.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9440
In the FreeBSD kernel the strdup signature is:
```
char *strdup(const char *__restrict, struct malloc_type *);
```
It's unfortunate that the developers have chosen to change
the signature of libc functions - but it's what I have to
deal with.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9433
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9181
A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as
per our convention to prevent any conflict with existing symbols.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9402
Make arc_stats visible to platform code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9386
Refactor the zfs ioctls in to platform dependent and independent bits.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9301
Refactor the zvol in to platform dependent and independent bits.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9295
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.
This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:
- Scrubs or resilvers to complete
- Devices to initialized
- Devices to be replaced
- Devices to be removed
- Checkpoints to be discarded
- Background freeing to complete
For example, a scrub that is in progress could be waited for by running
zpool wait -t scrub <pool>
This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.
This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.
Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:
- Added ZoL-style ioctl input declaration.
- Reorganized error handling in zpool_initialize in libzfs to integrate
better with changes made for TRIM support.
- Fixed check for whether a checkpoint discard is in progress.
Previously it also waited if the pool had a checkpoint, instead of
just if a checkpoint was being discarded.
- Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
- Updated more existing tests to make use of new 'zpool wait'
functionality, tests that don't exist in Delphix OS.
- Used existing ZoL tunable zfs_scan_suspend_progress, together with
zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
- Added support for a non-integral interval argument to zpool wait.
Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#9162
1. Fix issue: Kernel BUG with QAT during decompression #9276.
Now it is uninterruptible for a specific given QAT request,
but Ctrl-C interrupt still works in user-space process.
2. Copy the digest result to the buffer only when doing encryption,
and vise-versa for decryption.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com>
Closes#9276Closes#9303
Move Linux specific tracing headers and source to platform directories
and update the build system.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9290
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify(). A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool. This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer. For example, to perform a
`zpool attach` as shown in the abbreviated stack below.
To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property. The cached
value is then relied upon for subsequent accesses.
Call Trace:
spa_config_enter+0x1e8/0x350 [zfs]
zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
zio_read+0x6c/0x140 [zfs]
...
vfs_read+0xfc/0x1e0
kernel_read+0x50/0x90
...
spa_get_hostid+0x1c/0x38 [zfs]
spa_config_generate+0x1a0/0x610 [zfs]
vdev_label_init+0xa0/0xc80 [zfs]
vdev_create+0x98/0xe0 [zfs]
spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9256Closes#9285
When adding the SIMD compatibility code in e5db313 the decryption of a
dataset wrapping key was left in a user thread context. This was done
intentionally since it's a relatively infrequent operation. However,
this also meant that the encryption context templates were initialized
using the generic operations. Therefore, subsequent encryption and
decryption operations would use the generic implementation even when
executed by an I/O pipeline thread.
Resolve the issue by initializing the context templates in an I/O
pipeline thread. And by updating zio_do_crypt_uio() to dispatch any
encryption operations to a pipeline thread when called from the user
context. For example, when performing a read from the ARC.
Tested-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9215Closes#9296
Move platform specific Linux source under module/os/linux/
and update the build system accordingly. Additional code
restructuring will follow to make the common code fully
portable.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Closes#9206
Adds ZFS_MODULE_PARAM to abstract module parameter
setting to operating systems other than Linux.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9230
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.
As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context. No functional change, the source has been
refactored and the relevant #include's updated.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9198
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9238
With the other metaslab changes loaded onto a system, we can
significantly reduce the memory usage of each loaded metaslab and
unload them on demand if there is memory pressure. However, none
of those changes actually result in us keeping more metaslabs loaded.
If we don't keep more metaslabs loaded, we will still have to wait
for demand-loading to finish when no loaded metaslab can satisfy our
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while
unloading is currently done on a txg-count basis. Txgs can take a
widely varying range of times, from tenths of a second to several
seconds. This can result in confusing, hard to predict behavior.
This change simply adds a time-based component to metaslab unloading.
A metaslab will remain loaded for one minute and 8 txgs (by default)
after it was last used, unless it is evicted due to memory pressure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-65016
External-issue: DLPX-65047
Closes#9197
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.
We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7151Closes#8910Closes#9123Closes#9145
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.
This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9203
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing. With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.
By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes#8744
On systems with large amounts of storage and high fragmentation, a huge
amount of space can be used by storing metaslab range trees. Since
metaslabs are only unloaded during a txg sync, and only if they have
been inactive for 8 txgs, it is possible to get into a state where all
of the system's memory is consumed by range trees and metaslabs, and
txgs cannot sync. While ZFS knows how to evict ARC data when needed,
it has no such mechanism for range tree data. This can result in boot
hangs for some system configurations.
First, we add the ability to unload metaslabs outside of syncing
context. Second, we store a multilist of all loaded metaslabs, sorted
by their selection txg, so we can quickly identify the oldest
metaslabs. We use a multilist to reduce lock contention during heavy
write workloads. Finally, we add logic that will unload a metaslab
when we're loading a new metaslab, if we're using more than a certain
fraction of the available memory on range trees.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9128
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run
into a race condition where that vdev is temporarily unavailable. This
happens when a device removal operation and the old vdev_t has been
removed from the array, but the new indirect vdev has not yet been
inserted.
We hold the spa_config_lock while doing our sensitive verification.
To ensure that we don't deadlock, we only grab the lock if we don't
have config_writer held. In addition, I had to const the tags of the
refcounts and the spa_config_lock arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9112
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.
We fix this by only calling zil_remove_async when the file is fully
unlinked.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#8769Closes#9061
When a pool is imported it will scan the pool to verify the integrity
of the data and metadata. The amount it scans will depend on the
import flags provided. On systems with small amounts of memory or
when importing a pool from the crash kernel, it's possible for
spa_load_verify to issue too many I/Os that it consumes all the memory
of the system resulting in an OOM message or a hang.
To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes#9146
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9092
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.
This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9055
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.
This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.
We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes#8416
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex. Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9062
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).
This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#9015Closes#9044
lockdep reports a possible recursive lock in dbuf_destroy.
It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode. However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.
This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.
In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.
This is the lockdep report:
[ 122.950921] ============================================
[ 122.950921] WARNING: possible recursive locking detected
[ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O
[ 122.950921] --------------------------------------------
[ 122.950921] dbu_evict/1457 is trying to acquire lock:
[ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[ 122.950921]
but task is already holding lock:
[ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[ 122.950921]
other info that might help us debug this:
[ 122.950921] Possible unsafe locking scenario:
[ 122.950921] CPU0
[ 122.950921] ----
[ 122.950921] lock(&dn->dn_dbufs_mtx);
[ 122.950921] lock(&dn->dn_dbufs_mtx);
[ 122.950921]
*** DEADLOCK ***
[ 122.950921] May be due to missing lock nesting notation
[ 122.950921] 1 lock held by dbu_evict/1457:
[ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[ 122.950921]
stack backtrace:
[ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1
[ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009
[ 122.950921] Call Trace:
[ 122.950921] dump_stack+0x91/0xeb
[ 122.950921] __lock_acquire+0x2ca7/0x4f10
[ 122.950921] lock_acquire+0x153/0x330
[ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs]
[ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs]
[ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs]
[ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs]
[ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs]
[ 122.950921] taskq_thread+0x979/0x1480 [spl]
[ 122.950921] kthread+0x2e7/0x3e0
[ 122.950921] ret_from_fork+0x27/0x50
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#8984
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe
Change target size of metaslabs from 256GB to 16GB
c853f382db
zdb -L should skip leak detection altogether
21e7cf5da8
vs_alloc can underflow in L2ARC vdevs
7558997d2f
Simplify log vdev removal code
6c926f426a
Get rid of space_map_update() for ms_synced_length
425d3237ee
Introduce auxiliary metaslab histograms
928e8ad47d
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8442
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.
Saves stack a bit on lookup and unlink.
(#if0'd members instead of removing since comments refer to them.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#9025
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels. This is accomplished by leveraging
the fact that by definition dedicated kernel threads never need
to concern themselves with saving and restoring the user FPU state.
Therefore, they may use the FPU as long as we can guarantee user
tasks always restore their FPU state before context switching back
to user space.
For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used. All non-kernel
threads will restore the preserved user FPU state.
For 5.2 and latter kernels the user FPU state restoration will be
skipped if the kernel determines the registers have not changed.
Therefore, for these kernels we need to perform the additional
step of saving and restoring the FPU registers. Invalidating the
per-cpu global tracking the FPU state would force a restore but
that functionality is private to the core x86 FPU implementation
and unavailable.
In practice, restricting SIMD to kernel threads is not a major
restriction for ZFS. The vast majority of SIMD operations are
already performed by the IO pipeline. The remaining cases are
relatively infrequent and can be handled by the generic code
without significant impact. The two most noteworthy cases are:
1) Decrypting the wrapping key for an encrypted dataset,
i.e. `zfs load-key`. All other encryption and decryption
operations will use the SIMD optimized implementations.
2) Generating the payload checksums for a `zfs send` stream.
In order to avoid making any changes to the higher layers of ZFS
all of the `*_get_ops()` functions were updated to take in to
consideration the calling context. This allows for the fastest
implementation to be used as appropriate (see kfpu_allowed()).
The only other notable instance of SIMD operations being used
outside a kernel thread was at module load time. This code
was moved in to a taskq in order to accommodate the new kernel
thread restriction.
Finally, a few other modifications were made in order to further
harden this code and facilitate testing. They include updating
each implementations operations structure to be declared as a
constant. And allowing "cycle" to be set when selecting the
preferred ops in the kernel as well as user space.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8754Closes#8793Closes#8965
Currently, sequential async write workloads spend a lot of time
contending on the dn_struct_rwlock. This lock is responsible for
protecting the entire block tree below it; this naturally results
in some serialization during heavy write workloads. This can be
resolved by having per-dbuf locking, which will allow multiple
writers in the same object at the same time.
We introduce a new rwlock, the db_rwlock. This lock is responsible
for protecting the contents of the dbuf that it is a part of; when
reading a block pointer from a dbuf, you hold the lock as a reader.
When writing data to a dbuf, you hold it as a writer. This allows
multiple threads to write to different parts of a file at the same
time.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens matt@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-52564
External-issue: DLPX-53085
External-issue: DLPX-57384
Closes#8946
ZFS tracing efforts are hampered by the inability to access zfs static
probes(probes using DTRACE_PROBE macros). The probes are available via
tracepoints for GPL modules only. The build could be modified to
generate a function for each unique DTRACE_PROBE invocation. These could
be then accessed via kprobes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>
Closes#8659Closes#8663
* zfs redact error messages do not end with newline character
* 30af21b0 inadvertently removed some ZFS_PROP comments
* man/zfs: zfs redact <redaction_snapshot> is not optional
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8988
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.
This change adds check for empty sublists to avoid this.
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#8909
With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8951
The "zfs remap" command was disabled by
6e91a72fe3, because it has little utility
and introduced some tricky bugs. This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.
Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8944
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.
Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete. Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.
Porting notes: Added missing return value from cv_wait_sig()
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926Closes#8904
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#8945
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.
Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8737Closes#8870
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x). The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.
ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful. The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad. This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.
Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.
Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).
In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection. It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).
We should remove the dedupditto functionality. For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect. For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate. However, ZFS won't write any new dedupditto
blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Issue #8270Closes#8310
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#7958
`lz4_decompress_abd` is declared in zio_compress.h but it is not defined
anywhere. The declaration should be removed.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-47477
Closes#8894
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.
On debug bits this assertion will fail in spa_vdev_copy_segment():
ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);
On nondebug, the system panics with a stack like:
metaslab_free_concrete()
metaslab_free_impl()
metaslab_free_impl_cb()
vdev_indirect_remap()
free_from_removing_vdev()
metaslab_free_impl()
metaslab_free_dva()
metaslab_free()
Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default. We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.
This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes#8893
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes#8862
Scatter ABD's are allocated from a number of pages. In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer. Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer. This can have a
measurable performance overhead on some workloads.
https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter. For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path. For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).
This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.
All single-page (4K) ABD's can be represented this way. Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages). This is often the case for 2-page (8K)
ABD's.
Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.
Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space. This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8580
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations. The large allocations are for ZIL blocks. If there is a
lot of fragmentation, the large allocations can be hard to satisfy.
The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more. In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.
To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.
External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8865
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure,
on heavily fragmented pools they can still consume a significant
amount of memory. This problem is amplified when we fail to unload
metaslabs that we aren't using. Currently, we only unload a metaslab
during metaslab_sync_done; in order for that function to be called
on a given metaslab in a given txg, we have to have dirtied that
metaslab in that txg. If the dirtying was the result of an allocation,
we wouldn't be unloading it (since it wouldn't be 8 txgs since it
was selected), so in effect we only unload a metaslab during txgs
where it's being freed from.
We move the unload logic from sync_done to a new function, and
call that function on all metaslabs in a given vdev during
vdev_sync_done().
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#8837
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time. The user may
not know why, or be able to tell whether the import is progressing or is
hung.
Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one). The kstat is /proc/spl/kstat/zfs/import_progress.
The kstat contents are as follows:
pool_guid load_state multihost_secs max_txg pool_name
16667015954387398 3 15 0 tank3
load_state: the value of spa_load_state
multihost_secs: seconds until the end of the multihost activity
check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring
This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting. The
zpool import subcommand could also be modified to report this
information.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#8696
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object. It may need to be removed or kept depending on how the
object was modified at the source.
This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties. This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.
What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose. When set then
the object references a spill block and it must be kept. Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record. Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.
Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible. Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed. Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.
The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS. Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.
OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8668
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.
(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)
--
# zpool create p1 <device>
# zfs create p1/f1
# zfs set atime=off p1
# echo test > /p1/f1/test
# sync
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
p1 176K 18.9G 25.5K /p1
p1/f1 26K 18.9G 26K /p1/f1
# zfs get atime
NAME PROPERTY VALUE SOURCE
p1 atime off local
p1/f1 atime off inherited from p1
# stat /p1/f1/test | grep Access | tail -1
Access: 2019-04-26 23:32:33.741205192 +0900
# cat /p1/f1/test
test
# stat /p1/f1/test | grep Access | tail -1
Access: 2019-04-26 23:32:50.173231861 +0900
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--
The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.
Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.
Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.
The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8674Closes#8675
This comment seems to misunderstand the ## preprocessor token, which
does token concatenation. It is not needed here, since we are
concatenating string literals, which is performed by putting the
literals next to each other.
Additionally, the comment uses offensive language.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8698Closes#8699
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#8626
Nobody is going to be bumping SPA_VERSION again, as OpenZFS has moved on
to feature flags. Also, there is no requirement to keep GRUB
up-to-date, nor has that been happening.
The ZPL_VERSION could be bumped, but that would likely be handled in a
similar way, by adding filesystem feature flags. In any event, we do
not need this comment, and we certainly don't need a reference to the
GRUB 0.97 source code in a Solaris tree.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#8626
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes#8467Closes#8476
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends. By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.
This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool. The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience. The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.
The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq. This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.
In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property. It relies on the exact same infrastructure as the
manual TRIM. However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab. When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree. The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.
Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`. This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them. An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8419Closes#598
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7842
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.
This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.
To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8453
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.
This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.
This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8308
This patch adds the bookmark v2 feature to the on-disk format. This
feature will be needed for the upcoming redacted sends and for an
upcoming fix that for raw receives. The feature is not currently
used by any code and thus this change is a no-op, aside from the
fact that the user can now enable the feature.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8308
Currently, the receive code can create an unreadable dataset from
a correct raw send stream. This is because it is currently
impossible to set maxblkid to a lower value without freeing the
associated object. This means truncating files on the send side
to a non-0 size could result in corruption. This patch solves this
issue by adding a new 'force' flag to dnode_new_blkid() which will
allow the raw receive code to force the DMU to accept the provided
maxblkid even if it is a lower value than the existing one.
For testing purposes the send_encrypted_files.ksh test has been
extended to include a variety of truncated files and multiple
snapshots. It also now leverages the xattrtest command to help
ensure raw receives correctly handle xattrs.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8168Closes#8487
By default, when multihost is enabled for a pool, the pool is
suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write. This is the recommended
configuration.
The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.
There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This
in turn caused MMP to suspend the pool.
In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O. The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.
Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.
Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor. This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7709Closes#8495
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#8077
Instead of choosing a leaf vdev quasi-randomly, by starting at the root
vdev and randomly choosing children, rotate over leaves to issue MMP
writes. This fixes an issue in a pool whose top-level vdevs have
different numbers of leaves.
The issue is that the frequency at which individual leaves are chosen
for MMP writes is based not on the total number of leaves but based on
how many siblings the leaves have.
For example, in a pool like this:
root-vdev
+------+---------------+
vdev1 vdev2
| |
| +------+-----+-----+----+
disk1 disk2 disk3 disk4 disk5 disk6
vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1
is chosen, disk1 will be chosen. However, every time vdev2 is chosen,
disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as
many MMP writes as disk2.
This may create wear issues in the case of SSDs. It also reduces the
effectiveness of MMP as it depends on the writes being evenly
distributed for the case where some devices fail or are partitioned.
The new code maintains a list of leaf vdevs in the pool. MMP records
the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose
the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list,
continuing from the head if the tail is reached. It stops when a
suitable leaf is found or all leaves have been examined.
Added a test to verify MMP write distribution is even.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7953
The function bpobj_iterate_impl overflows the stack when bpobjs
are deeply nested. Rewrite the function to eliminate the recursion.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#7674Closes#7675Closes#7908
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.
Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes#8433
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.
The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram. The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.
In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8358
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#7513
This patch is an async implementation of the existing sync
zfs_unlinked_drain() function. This function is called at mount time and
is responsible for freeing znodes that we didn't get to freeing before.
We don't have to hold mounting of the dataset until the unlinked list is
fully drained as is done now. Since we can process the unlinked set
asynchronously this results in a better user experience when mounting a
dataset with entries in the unlinked set.
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#8142
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).
This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.
The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.
As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8328
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.
Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8327
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.
Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8290
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8299
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8291
The following fields from the vdev_t struct are not used anywhere.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8285
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8229
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.
The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.
The zap_create_* functions have been updated to take advantage of
this new functionality. The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().
The new symbols have been exported for use by Lustre.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8015
PROBLEM
========
When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.
SOLUTION
=========
This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.
When stopping or cancelling a vdev initialization perform it
in two phases. First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.
On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second. The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.
This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist. The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s. This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.
Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#8230
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210ebCloses#8230
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping. However, the code is still confusing and error-prone,
because this thread has two responsibilities. We should instead
separate this into two threads each with their own responsibility:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.
Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Brad Lewis <brad.lewis@delphix.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9Closes#8165
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.
Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).
Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.
The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).
For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.
Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.
This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).
Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).
Porting Notes:
* Reconciled conflicts with the fastwrite feature.
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6Closes#8188
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.
We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#1537Closes#7926
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.
NAME STATE READ WRITE CKSUM SLOW
testpool ONLINE 0 0 0 -
mirror-0 ONLINE 0 0 0 -
loop0 ONLINE 0 0 0 20
loop1 ONLINE 0 0 0 0
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7756Closes#6885
Adds a libzutil for utility functions that are common to libzfs and
libzpool consumers (most of what was in libzfs_import.c). This
removes the need for utilities to link against both libzpool and
libzfs.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#8050
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).
In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.
This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8010
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.
This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7732
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8032
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.
This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.
This functionality will be used by the Redacted Send/Receive feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#7981
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
`mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
thread #2, so it stuck and can't complete, then txg "n" will
not complete.
So thread #1 and thread #2 are deadlocked.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes#7939
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing. While this is the most common cause, it's not the only
cause. Attemping to access a damaged ZAP will result in other
errors.
The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import. Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully. Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7809Closes#7921
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t. However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.
We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.
Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
check to prevent needless warnings when integrating changes
which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
needed to defer the cv_destroy() which was called immediately
after cv_broadcast(). With d2733258 this should be safe but
if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
to this change in order to make it easier to apply OpenZFS 9689.
- d88895a0, which removed the dummy znode from zvol_state
- e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function. Preserved the
rangelock free list. The cv_destroy() function will block waiting
for all processes in cv_wait() to be scheduled and drop their
reference. This is done to ensure it's safe to free the condition
variable. However, blocking while holding the rl->rl_lock mutex
can result in a deadlock on Linux. A free list is introduced to
defer the cv_destroy() and kmem_free() until after the mutex is
released.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes#7980
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#7982
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred. Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;
This issue may occur when using encryption with refcount debugging
enabled. An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder. When unsharing the buffer the
p_abd should be transferred.
This issue does not impact production builds because refcount holders
are not tracked.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7219Closes#8000
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4baCloses#7976
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.
In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7949
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.
To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes#7963
The DD_FIELD_LAST_REMAP_TXG macro was added twice (with the same value).
This change removes one of them.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7968
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):
* We don't account for the fact that seq_file sometimes visits a node
multiple times, which results in missing messages when read through
procfs.
* We don't keep separate state for each reader of a file, so concurrent
readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
the list between read syscalls, so reading from these files in procfs
can cause the system to crash.
This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes#7819
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.
Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.
Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes#7885Closes#7932
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module). When not built as a module add
the sysfs entries under /sys/fs/zfs/.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#7868Closes#7872
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#5182
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).
Programmatically document, via a zfs_ioc_key_t, the valid arguments
for the ioc commands that rely on nvpair input arguments (i.e. non
legacy commands from libzfs_core). Automatically verify the expected
pairs before dispatching a command.
This initial phase focuses on the non-legacy ioctls. A follow-on
change can address the legacy ioctl input from the zfs_cmd_t.
The zfs_ioc_key_t for zfs_keys_channel_program looks like:
static const zfs_ioc_key_t zfs_keys_channel_program[] = {
{"program", DATA_TYPE_STRING, 0},
{"arg", DATA_TYPE_UNKNOWN, 0},
{"sync", DATA_TYPE_BOOLEAN_VALUE, ZK_OPTIONAL},
{"instrlimit", DATA_TYPE_UINT64, ZK_OPTIONAL},
{"memlimit", DATA_TYPE_UINT64, ZK_OPTIONAL},
};
Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE,
EBADF, and E2BIG).
ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#7780
This extends our sysfs '/sys/module/zfs' entry to include feature
and property attributes. The primary consumer of this information
is user processes, like the zfs CLI, that need to know what the
current loaded ZFS module supports. The libzfs binary will consult
this information when instantiating the zfs and zpool property
tables and the pool features table.
This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
features.runtime
features.pool
properties.dataset
properties.pool
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#7706
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7753Closes#7795
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7795
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#7705
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7650
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:
* Receiving an unencrypted stream as encrypted without
needing to create a "dummy" encrypted parent so that
encryption can be inheritted.
* Allowing users to change their keylocation on receive,
so long as the receiving dataset is an encryption root.
* Allowing users to explicitly exclude or override the
encryption property from an unencrypted properties stream,
allowing it to be received as encrypted.
* Receiving a recursive heirarchy of unencrypted datasets,
encrypting the top-level one and forcing all children to
inherit the encryption.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7650
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).
On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching. We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().
This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes#7736
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#5866Closes#7761
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
that control the crypto implementation. At the moment there is a
choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
with as key schedules generated with various implementations
are not necessarily interchangable.
Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes#7102Closes#7103
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.
This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#7758
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3Closes#7682
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
were intentially not added as tunable module parameters.
OpenZFS-issue: https://illumos.org/issues/9465
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3efCloses#7749
= Motivation
While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.
An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists. Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg
Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.
= What this patch does
In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).
= Performance Analysis
To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.
== Results (before patch):
=== Sampling Flamegraphs
[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg
=== Wall clock times (in seconds)
```
[Trial 4]
real 5.3
user 0.4
sys 2.3
[Trial 5]
real 8.2
user 0.4
sys 2.4
[Trial 6]
real 6.0
user 0.5
sys 2.3
```
== Results (after patch):
=== Sampling Flamegraphs
[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg
=== Wall clock times (in seconds)
```
[Trial 4]
real 4.9
user 0.0
sys 0.9
[Trial 5]
real 3.8
user 0.0
sys 0.9
[Trial 6]
real 3.6
user 0.0
sys 0.9
```
== Analysis
The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.
From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts. Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().
From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).
Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
zpool_do_events_nvprint().
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1Closes#7748
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.
Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
and dmu_object_alloc_dnsize() functions to use a common shared
dmu_object_alloc_impl() function.
OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408bCloses#7712
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes#7718
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure. Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.
Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev. The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.
From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid. If
a match is found the ZED reopens the pool vdev. This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read. Otherwise, it wouldn't
be possible to report the maximum possible expansion size.
Finally, if the property autoexpand=on a vdev expansion will be
attempted. After performing some sanity checks on the disk to
verify that it is safe to expand, the primary partition (-part1)
will be expanded and the partition table updated. The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.
In order to make all of the above possible the following changes
were required:
* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
These tests now create a pool which is layered on a loopback,
scsi_debug, and file vdev. This allows for testing of non-
partitioned block device (loopback), a partition block device
(scsi_debug), and a file which does not receive udev change
events. This provided for better test coverage, and by removing
the layering on ZFS volumes there issues surrounding layering
one pool on another are avoided.
* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
This allows for matching by guid rather than path which is a
more reliable way for the ZED to reference a vdev.
* Fixed zfs_zevent_wait() signal handling which could result
in the ZED spinning when a signal was not handled.
* Removed vdev_disk_rrpart() functionality which can be abandoned
in favor of kernel provided blkdev_reread_part() function.
* Added a rwlock which is held as a writer while a disk is being
reopened. This is important to prevent errors from occurring
for any configuration related IOs which bypass the SCL_ZIO lock.
The zpool_reopen_007_pos.ksh test case was added to verify IO
error are never observed when reopening. This is not expected
to impact IO performance.
Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.
* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
ZFS volumes. This is as good as a unique physical path, while the
volumes are not used in the test cases anymore for other reasons
this improvement was included.
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#120Closes#2437Closes#5771Closes#7366Closes#7582Closes#7629
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:
The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf. Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.
Performance Testing:
- Disabled kmem_flags
- Tuned dbuf_cache_max_bytes very low (128K)
- Tuned zfs_arc_max very low (64M)
Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.
Results:
| Test | Time (trunk / diff) | I/Os (trunk / diff) |
+------------------------+---------------------+---------------------+
| zpool import | 0:05 / 0:06 | 12.9k / 12.9k |
| zfs get all (uncached) | 1:36 / 0:53 | 16.7k / 5.7k |
| zfs get all (cached) | 1:36 / 0:51 | 16.0k / 6.0k |
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52fCloses#7668
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock. This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.
-- THREAD A --
spa_sync()
spa_sync_upgrades()
rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B
-- THREAD B --
mount_fs()
zpl_mount()
zpl_mount_impl()
dmu_objset_hold()
dmu_objset_hold_flags()
dsl_pool_hold()
dsl_pool_config_enter()
rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
sget()
sget_userns()
grab_super()
down_write(&s->s_umount); <- Waiting on C
-- THREAD C --
cleanup_mnt()
deactivate_super()
down_write(&s->s_umount);
deactivate_locked_super()
zpl_kill_sb()
kill_anon_super()
generic_shutdown_super()
sync_filesystem()
zpl_sync_fs()
zfs_sync()
zil_commit()
txg_wait_synced() <- Waiting on A
Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7598Closes#7659Closes#7691Closes#7693
Motivation
==========
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
New encoding
============
The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.
Other references:
=================
The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238Closes#7665
Details about the motivation of this feature and its usage can
be found in this blogpost:
https://sdimitro.github.io/post/zpool-checkpoint/
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
* renames members of "struct metaslab trees to be shorter without
losing meaning
* space_map_{alloc,truncate}() accept a block size as a
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for metaslab
space maps in terms of bandwirdth since they are scattered all
over the disk. But for other space maps this default is probably
not what we want. Examples are device removal's vdev_obsolete_sm
or vdev_chedkpoint_sm from this review. Both of these have a
1:1 relationship with each vdev and could benefit from a bigger
block size.
Porting notes:
* The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.
* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
to block device backed pools.
* ZTS:
* Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".
* Don't use large dd block sizes on /dev/urandom under Linux in
checkpoint_capacity.
* Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
its attempts to fill the pool
* Create the base and nested pools with sync=disabled to speed up
the "setup" phase.
* Clear labels in test pool between checkpoint tests to avoid
duplicate pool issues.
* The import_rewind_device_replaced test has been marked as "known
to fail" for the reasons listed in its DISCLAIMER.
* New module parameters:
zfs_spa_discard_memory_limit,
zfs_remove_max_bytes_pause (not documented - debugging only)
vdev_max_ms_count (formerly metaslabs_per_vdev)
vdev_min_ms_count
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8Closes#7570
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe. As part of this change the current_time() function was
also updated to return the timespec64 type.
Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode. It should
be used when working with inode timestamps to ensure matching types.
The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t. Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.
Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other. They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7643
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7637
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes#7618
Reserve bit 25 for the ZSTD compression feature from FreeBSD.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes#7626
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:
1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
TSD values (by setting to NULL vs non-NULL) is expensive, and
we do it very often.
3. According to Nexenta, the current semantics can cause a
deadlock when concurrently calling dmu_objset_evict_dbufs()
(which is rare today, but they are working on a "parallel
unmount" change that triggers this more easily):
Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes#7602
This patch fixes a small bug found where receive_spill() sometimes
attempted to decrypt spill blocks when doing a raw receive. In
addition, this patch fixes another small issue in arc_buf_fill()'s
error handling where a decryption failure (which could be caused by
the first bug) would attempt to set the arc header's IO_ERROR flag
without holding the header's lock.
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7564Closes#7584Closes#7592
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Paul Dagnelie <pcd@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7
Issue #3752Closes#7462
1. Add a proc entry to display the pool's state:
$ cat /proc/spl/kstat/zfs/tank/state
ONLINE
This is done without using the spa config locks, so it will
never hang.
2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7331Closes#7563
The only remaining consumer of the rwlock compatibility wrappers
is ztest. Remove the wrappers and convert the few remaining
calls to the underlying pthread functions.
rwlock_init() -> pthread_rwlock_init()
rwlock_destroy() -> pthread_rwlock_destroy()
rw_rdlock() -> pthread_rwlock_rdlock()
rw_wrlock() -> pthread_rwlock_wrlock()
rw_unlock() -> pthread_rwlock_unlock()
Note pthread_rwlock_init() defaults to PTHREAD_PROCESS_PRIVATE
which is equivilant to the USYNC_THREAD behavior. There is no
functional change.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7591
txg_kick() fails to see that we are quiescing, forcing transactions to
their next stages without leaving them accumulate changes
Creating a fragmented pool in a DCenter VM and continuously writing to it with
multiple instances of randwritecomp, we get the following output from txg.d:
0ms 311MB in 4114ms (95% p1) 75MB/s 544MB (76%) 336us 153ms 0ms
0ms 8MB in 51ms ( 0% p1) 163MB/s 474MB (66%) 129us 34ms 0ms
0ms 366MB in 4454ms (93% p1) 82MB/s 572MB (79%) 498us 20ms 0ms
0ms 406MB in 5212ms (95% p1) 77MB/s 591MB (82%) 661us 37ms 0ms
0ms 340MB in 5110ms (94% p1) 66MB/s 622MB (86%) 1048us 41ms 1ms
0ms 3MB in 61ms ( 0% p1) 51MB/s 419MB (58%) 33us 0ms 0ms
0ms 361MB in 3555ms (88% p1) 101MB/s 542MB (75%) 335us 40ms 0ms
0ms 356MB in 4592ms (92% p1) 77MB/s 561MB (78%) 430us 89ms 1ms
0ms 11MB in 129ms (13% p1) 90MB/s 507MB (70%) 222us 15ms 0ms
0ms 281MB in 2520ms (89% p1) 111MB/s 542MB (75%) 334us 42ms 0ms
0ms 383MB in 3666ms (91% p1) 104MB/s 557MB (77%) 411us 133ms 0ms
0ms 404MB in 5757ms (94% p1) 70MB/s 635MB (88%) 1274us 123ms 2ms
4ms 367MB in 4172ms (89% p1) 88MB/s 556MB (77%) 401us 51ms 0ms
0ms 42MB in 470ms (44% p1) 90MB/s 557MB (77%) 412us 43ms 0ms
0ms 261MB in 2273ms (88% p1) 114MB/s 556MB (77%) 407us 27ms 0ms
0ms 394MB in 3646ms (85% p1) 108MB/s 552MB (77%) 393us 304ms 0ms
0ms 275MB in 2416ms (89% p1) 113MB/s 510MB (71%) 200us 53ms 0ms
0ms 9MB in 53ms ( 0% p1) 169MB/s 483MB (67%) 140us 100ms 1ms
The TXGs that are getting synced and don't have lots of changes are pushed by
txg_kick() which basically forces the current open txg to get to the quiesced
state:
if (tx->tx_syncing_txg == 0 &&
tx->tx_quiesce_txg_waiting <= tx->tx_open_txg &&
tx->tx_sync_txg_waiting <= tx->tx_synced_txg &&
tx->tx_quiesced_txg <= tx->tx_synced_txg) {
tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1;
cv_broadcast(&tx->tx_quiesce_more_cv);
}
The problem is that the above code doesn't check if we are currently quiescing
anything (only if a quiesce or a sync has been requested, ..etc) so the
following scenario can happen:
1] We have an open txg A that had enough dirty data (more than
zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened
a new txg B. No txg is currently being synced.
2] Immediately after the opening of B, txg_kick() was run by some other write
(and because of A's dirty data) and saw that we are not currently syncing
any txg and no one has requested quiescing so it requests one by bumping
tx_quiesce_txg_waiting and broadcasts the quiesce thread.
3] The quiesce thread just passed txg A to be synced and sees that a quiescing
request has been sent to it so it immediately grabs B without letting it
gather enough data, putting it in a quiesced state and opening a new txg C.
In this scenario txg B, is an example of how the entries of interest show up in
the txg.d output.
Ideally we would like txg_kick() to get triggered only when we are sure that
we are not syncing AND not quiescing any txg. This way we can kick an open TXG
to the quiescing state when we are sure that there is nothing going on and we
would benefit from the different states running concurrently.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9464
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635bCloses#7587
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.
For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44Closes#7532
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.
Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes#7546
Issue #7582
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.
Build system and packaging:
* Renamed SPL_* autoconf m4 macros to ZFS_*.
* Removed redundant SPL_* autoconf m4 macros.
* Updated the RPM spec files to remove SPL package dependency.
* The zfs package obsoletes the spl package, and the zfs-kmod
package obsoletes the spl-kmod package.
* The zfs-kmod-devel* packages were updated to add compatibility
symlinks under /usr/src/spl-x.y.z until all dependent packages
can be updated. They will be removed in a future release.
* Updated copy-builtin script for in-kernel builds.
* Updated DKMS package to include the spl.ko.
* Updated stale AUTHORS file to include all contributors.
* Updated stale COPYRIGHT and included the SPL as an exception.
* Renamed README.markdown to README.md
* Renamed OPENSOLARIS.LICENSE to LICENSE.
* Renamed DISCLAIMER to NOTICE.
Required code changes:
* Removed redundant HAVE_SPL macro.
* Removed _BOOT from nvpairs since it doesn't apply for Linux.
* Initial header cleanup (removal of empty headers, refactoring).
* Remove SPL repository clone/build from zimport.sh.
* Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
to build issues when forcing C99 compilation.
* Replaced legacy ACCESS_ONCE with READ_ONCE.
* Include needed headers for `current` and `EXPORT_SYMBOL`.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes#7556
Device removal allocates a new location for each allocated segment on
the disk that's being removed. Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location. When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table. In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.
We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries. For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts). In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use). We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.
This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible. But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.
In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.
Porting notes:
* Add the following as module parameters:
* zfs_condense_indirect_vdevs_enable
* zfs_condense_max_obsolete_bytes
* Document the following module parameters:
* zfs_condense_indirect_vdevs_enable
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c
External-issue: DLPX-57962
Closes#7536
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123Closes#7459
Problem
=======
When we fail to open or import a storage pool, we typically don't
get any additional diagnostic information, just "no pool found" or
"can not import".
While there may be no additional user-consumable information, we should
at least make this situation easier to debug/diagnose for developers
and support. For example, we could start by using `zfs_dbgmsg()`
to log each thing that we try when importing, and which things
failed. E.g. "tried uberblock of txg X from label Y of device Z". Also,
we could log each of the stages that we go through in `spa_load_impl()`.
Solution
========
Following the cleanup to `spa_load_impl()`, debug messages have been
added to every point of failure in that function. Additionally,
debug messages have been added to strategic places, such as
`vdev_disk_open()`.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/8961
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0Closes#7459
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.
This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7474
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.
Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:
* Requires that callers intending to use this extended interface
set the FMODE_KABI_ITERATE flag on the file structure when
opening the directory.
* Adds the fops.iterate() method to the end of the structure,
without removing fops.readdir().
This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected. Instead fallback to
the fops.readdir() interface which will be available.
Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7460Closes#7463
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5
OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668Closes#7467
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.
The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.
Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.
Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.
Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.
Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7401Closes#7421
In the existing code, when doing a raw (encrypted) zfs receive,
we call arc_convert_to_raw() from open context. This creates a
race condition between arc_release()/arc_change_state() and
writing out the block from syncing context (arc_write_ready/done()).
This change makes it so that when we are doing a raw (encrypted)
zfs receive, we save the crypt parameters (salt, iv, mac) of dnode
blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw()
from syncing context when writing out the block of dnodes.
Additionally, we can eliminate dr_raw and associated setters, and
instead know that dnode blocks are always raw when doing a zfs
receive (see the new field os_raw_receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7424Closes#7429
The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in
spa_condense_indirect_thread(), so it calls the
spa_condense_indirect_complete_sync() sync task which sets
the spa_condensing_indirect field to NULL. Waiting for the
sync task to finish, thread A sleeps until the txg is done.
When this happens, thread A will acquire spa_async_lock and
set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is
running spa_sync() checks whether it should condense the
second vdev in vdev_indirect_should_condense() by checking the
spa_condensing_indirect field which was set to NULL by
spa_condense_indirect_thread() from thread A. So it goes on
and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned
assertions fails because thread A has not set spa_condense_thread
to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.
This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.
In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.
The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
currently condensing or not
- condensing is more decoupled from the spa_async_thread related
functionality.
As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606eeCloses#6900
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591Closes#6900
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1ebCloses#6900
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.
This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7147Closes#7388
This reverts commit cc63068e95.
Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory. See #7401 for full
details.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401
Cloes #7416
mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.
Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7390
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.
In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7372
During a receive operation zvol_create_minors_impl() can wait
needlessly for the prefetch thread because both share the same tasks
queue. This results in hung tasks:
<3>INFO: task z_zvol:5541 blocked for more than 120 seconds.
<3> Tainted: P O 3.16.0-4-amd64
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
The first z_zvol:5541 (zvol_task_cb) is waiting for the long running
traverse_prefetch_thread:260
root@linux:~# cat /proc/spl/taskq
taskq act nthr spwn maxt pri mina
spl_system_taskq/0 1 2 0 64 100 1
active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40)
wait: 5541
spl_delay_taskq/0 0 1 0 4 100 1
delay: spa_deadman [zfs](0xffff880039924000)
z_zvol/1 1 1 0 1 120 1
active: [5541]zvol_task_cb [zfs](0xffff88001fde6400)
pend: zvol_task_cb [zfs](0xffff88001fde6800)
This change adds a dedicated, per-pool, prefetch taskq to prevent the
traverse code from monopolizing the global (and limited) system_taskq by
inappropriately scheduling long running tasks on it.
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6330Closes#6890Closes#7343
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
to successfully run in under 5 minutes.
OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06aCloses#6112Closes#7336
zfs_dbgmsg() should record a message by default. As a general
principal, these messages shouldn't be too verbose. Furthermore, the
amount of memory used is limited to 4MB (by default).
dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags. This flag is not set by default
(even on debug builds). These messages are extremely verbose, and
sometimes nontrivial to compute.
SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags. This flag is not set by default (even on debug builds).
This brings our behavior in line with illumos. Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7314
This patch adds some comments describing the purpose of "portable"
dnode and objset flags so that it is clear when new flags should
be added to the repective flag masks. This patch includes no
functional changes.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7313
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.
Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.
When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.
In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.
In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7296
We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.
==
For example:
Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"
Thread A Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
DTRACE_PROBE2() {
__string() {
strlen(zrl->zr_caller) -> 3
allocate buf[4]
}
zrl->zr_owner = B
zrl->zr_caller = "abcd"
__assign_str() {
strcpy(buf, zrl->zr_caller) <- buffer overflow
==
Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.
To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7291
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to. If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.
Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column. The error code for such entries is a
bitfield, displayed in hex:
0x1 At least one vdev (interior or leaf) was not writeable.
0x2 At least one writeable leaf vdev was found, but it had a pending
MMP write.
timestamp = the time in seconds since the epoch when no leaf could be
found originally.
duration = the time (in ns) during which no MMP block was written for
this reason. This does not include the preceeding inter-write period
nor the following inter-write period.
vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.
Sample output, truncated to fit:
For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".
id txg timestamp error duration mmp_delay vdev_guid ...
936 11 1520036441 0 146264 891422313 1740883117838 ...
937 11 1520036441 0 163956 888356657 7320395061548 ...
938 11 1520036442 0 130690 885314969 7320395061548 ...
939 11 1520036442 0 2001068577 882296582 1740883117838 ...
940 11 1520036443 0 161806 882296582 7320395061548 ...
941 11 1520036443 0x2 0 998020546 1 ...
942 11 1520036444 0 136585 998020546 7320395061548 ...
943 11 1520036444 0x2 0 998020257 1 ...
944 11 1520036445 5 2002662964 994160219 1740883117838 ...
945 11 1520036445 0x2 998073118 994160219 3 ...
946 11 1520036447 0 247136 994160219 7320395061548 ...
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7212
Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.
Also, convert the checksum and IO delay thresholds to module params for
easy testing.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7252
b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7215Closes#7236
The conditional was reversed which caused garbage values to be used when
calculating dds_ref_dsize.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#7234
Currently, raw zfs sends transfer the encrypted master keys and
objset_phys_t encryption parameters in the DRR_BEGIN payload of
each send file. Both of these are processed as soon as they are
read in dmu_recv_stream(), meaning that the new keys are set
before the new snapshot is received. In addition to the fact that
this changes the user's keys for the dataset earlier than they
might expect, the keys were never reset to what they originally
were in the event that the receive failed. This patch splits the
processing into objset handling and key handling, the later of
which is moved to dmu_recv_end() so that they key change can be
done atomically.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7200
Currently, the DMU relies on ZIO layer compression to free LO
dnode blocks that no longer have objects in them. However,
raw receives disable all compression, meaning that these blocks
can never be freed. In addition to the obvious space concerns,
this could also cause incremental raw receives to fail to mount
since the MAC of a hole is different from that of a completely
zeroed block.
This patch corrects this issue by adding a special case in
zio_write_compress() which will attempt to compress these blocks
to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also
removes the zfs_mdcomp_disable tunable, since tuning it could
cause these same issues.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7198
Currently, ZIL claiming dirties objsets which causes
dsl_pool_sync() to attempt to perform user accounting on
them. This causes problems for encrypted datasets that were
raw received before the system went offline since they
cannot perform user accounting until they have their keys
loaded. This triggers an ASSERT in zio_encrypt(). Since
encryption was added, the code now depends on the fact that
data should only be written when objsets are owned. This
patch adds a check in dmu_objset_do_userquota_updates()
to ensure that useraccounting is only done when the objsets
are actually owned for write. As part of this work, the
zfsvfs and zvol code was updated so that it no longer lies
about owning objsets readonly.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6916Closes#7163
CID 173243, 173245: Memory - corruptions (OVERRUN)
Added size argument to lcompat_sprintf() to avoid use of INT_MAX
CID 173244: Integer handling issues (OVERFLOW_BEFORE_WIDEN)
Added cast to uint64_t to avoid a 32 bit overflow warning
CID 173242: Integer handling issues (CONSTANT_EXPRESSION_RESULT)
Conditionally removed unused luai_numisnan() floating point check
CID 173241: Resource leaks (RESOURCE_LEAK)
Added missing close(fd) on error path
CID 173240: (UNINIT)
Fixed uninitialized variable in get_special_prop()
CID 147560: Null pointer dereferences (NULL_RETURNS)
Cleaned up bad code merge in dsl_dataset_promote_check()
CID 28475: Memory - illegal accesses (OVERRUN)
Fixed lcompat_sprintf() to use a size paramater
CID 28418, 28422: Error handling issues (CHECKED_RETURN)
Added function result cast to (void) to avoid warning
CID 23935, 28411, 28412: Memory - corruptions (ARRAY_VS_SINGLETON)
Added casts to avoid exposing result as an array
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#7181
PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
> $C
ffffff01809128c0 vpanic()
ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
ffffff3373370908)
ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
ffffff0180912b40 thread_start+8()
> ::status
debugging crash dump vmcore.2 (64-bit) from batfs0390
operating system: 5.11 joyent_20170911T171900Z (i86pc)
image uuid: (not set)
panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
owner=ffffff3c59b39480 thread=ffffff0180912c40
dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
zio tree:
pio
|--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
zio tree:
pio
|--- lio
|--- cio
The lio then grabs the parent's io_lock and removes itself.
zio tree:
pio
|--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c
Issue #5918Closes#7168
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.
Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').
By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.
Support the following commands and functionalities:
zfs set projectquota@project
zfs set projectobjquota@project
zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project
zfs projectspace
zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused
zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused
chattr +/-P
chattr -p project_id
lsattr -p
This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>
For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes#6290
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes#7011Closes#7054
A new interface was added to manipulate the version field of an
inode. Add a inode_set_iversion() wrapper for older kernels and
use the new interface when available.
The i_version field was dropped from the trace point due to the
switch to an atomic64_t i_version type.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7148
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.
=== What is implemented?
This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
the context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
=== How do we handle zfs.sync functions in open context?
When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...
arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg
if we run it in open context, we will get back the following error:
Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
[C]: in function 'destroy'
[string "channel program"]:3: in main chunk
=== What about testing?
We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.
OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15Closes#6558
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Every time we want to unmount a snapshot (happens during snapshot
deletion or renaming) we unnecessarily iterate through all the
mountpoints in the VFS layer (see zfs_get_vfs).
The current patch completely gets rid of that code and changes
the approach while keeping the behavior of that code path the
same. Specifically, it puts a hold on the dataset/snapshot and
gets its vfs resource reference directly, instead of linearly
searching for it. If that reference exists we attempt to amount
it.
With the above change, it became obvious that the nvlist
manipulations that we do (add_boolean and add_nvlist) take a
significant amount of time ensuring uniqueness of every new
element even though they don't have too. Thus, we updated the
patch so those nvlists are not trying to enforce the uniqueness
of their elements.
A more complete analysis of the problem solved by this patch
can be found below:
https://sdimitro.github.io/post/snap-unmount-perf/
OpenZFS-issue: https://www.illumos.org/issues/8604
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this entails extra
logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.
OpenZFS-issue: https://www.illumos.org/issues/8600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
ZFS channel programs should be able to perform a rollback.
OpenZFS-issue: https://www.illumos.org/issues/8592
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533
Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
Currently, os_next_write_raw is a single boolean used for determining
whether or not the next call to dmu_objset_sync() should write out
the objset_phys_t as a raw buffer. Since the boolean is not associated
with a txg, the work simply happens during the next txg, which is not
necessarily the correct one. In the current implementation this issue
was misdiagnosed, resulting in a small hack in dmu_objset_sync() which
seemed to resolve the problem.
This patch changes os_next_write_raw to be an array of booleans, one
for each txg in TXG_OFF and removes the hack.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6864
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.
This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.
This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6821Closes#6864
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.
Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.
This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.
This patch also contains minor bug fixes and cleanups.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6845Closes#6864Closes#7052
This change was missed in 0735ecb334.
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: András Korn <korn-github.com@elan.rulez.org>
Closes#7096
PROBLEM
=======
When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible
for either `ERESTART` or `EIO` to be returned.
If `ERESTART` is returned, this will cause an assertion to fail directly
in `zil_lwb_write_issue`, where the code assumes the return value is
`EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the
SPA is suspended when `dmu_tx_assign` is called, and most often occurs
when running `zloop`.
If `EIO` is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, `zil_commit_waiter_timeout` contains the
following logic:
lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);
In this case, if `dmu_tx_assign` returned `EIO` from within
`zil_lwb_write_issue`, the `lwb` variable passed in will not be issued
to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and
this assertion will fail. `zil_commit_waiter_timeout` assumes that after
it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where `dmu_tx_assign` returns `EIO`.
SOLUTION
========
This change modifies the `dmu_tx_assign` function such that `txg_how` is
a bitmask, rather than of the `txg_how_t` enum type. Now, the previous
`TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with
specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics.
Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was
automatically invoked. This was not ideal when using `TXG_WAITED` within
`zil_lwb_write_issued`, leading the problem described above. Rather, we
want to achieve the semantics of `TXG_WAIT`, while also preventing the
`tx` from being penalized via the dirty delay throttling.
With this change, `zil_lwb_write_issued` can acheive the semtantics that
it requires by passing in the value `TXG_WAIT | TXG_NOTHROTTLE` to
`dmu_tx_assign`.
Further, consumers of `dmu_tx_assign` wishing to achieve the old
`TXG_WAITED` semantics can pass in the value `TXG_NOWAIT | TXG_NOTHROTTLE`.
Authored by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE`
OpenZFS-issue: https://www.illumos.org/issues/8997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9Closes#7084
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems. The proposed changes include:
* Added a new `zfs_deadman_failmode` module option which is
used to dynamically control the behavior of the deadman. It's
loosely modeled after, but independant from, the pool failmode
property. It can be set to wait, continue, or panic.
* wait - Wait for the "hung" I/O (default)
* continue - Attempt to recover from a "hung" I/O
* panic - Panic the system
* Added a new `zfs_deadman_ziotime_ms` module option which is
analogous to `zfs_deadman_synctime_ms` except instead of
applying to a pool TXG sync it applies to zio_wait(). A
default value of 300s is used to define a "hung" zio.
* The ztest deadman thread has been re-enabled by default,
aligned with the upstream OpenZFS code, and then extended
to terminate the process when it takes significantly longer
to complete than expected.
* The -G option was added to ztest to print the internal debug
log when a fatal error is encountered. This same option was
previously added to zdb in commit fa603f82. Update zloop.sh
to unconditionally pass -G to obtain additional debugging.
* The FM_EREPORT_ZFS_DELAY event which was previously posted
when the deadman detect a "hung" pool has been replaced by
a new dedicated FM_EREPORT_ZFS_DEADMAN event.
* The proposed recovery logic attempts to restart a "hung"
zio by calling zio_interrupt() on any outstanding leaf zios.
We may want to further restrict this to zios in either the
ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
Calling zio_interrupt() is expected to only be useful for
cases when an IO has been submitted to the physical device
but for some reasonable the completion callback hasn't been
called by the lower layers. This shouldn't be possible but
has been observed and may be caused by kernel/driver bugs.
* The 'zfs_deadman_synctime_ms' default value was reduced from
1000s to 600s.
* Depending on how ztest fails there may be no cache file to
move. This should not be considered fatal, collect the logs
which are available and carry on.
* Add deadman test cases for spa_deadman() and zio_wait().
* Increase default zfs_deadman_checktime_ms to 60s.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6999
usr/src/uts/common/sys/fs/zfs.h
Change ZPROP_INVAL and ZPROP_CONT from macros to enum values. Clang
and GCC both prefer to use unsigned ints to store enums. That was
causing tautological comparison warnings (and likely eliminating
error handling code at compile time) whenever a zfs_prop_t or
zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT. Making the
error flags be explicity enum values forces the enum types to be
signed.
ZPROP_INVAL was also compared against two different enum types. I
had to change its name to ZPOOL_PROP_INVAL whenever its compared to
a zpool_prop_t. There are still some places where ZPROP_INVAL or
ZPROP_CONT is compared to a plain int, in code that doesn't know
whether the int is storing a zfs_prop_t or a zpool_prop_t.
usr/src/uts/common/fs/zfs/spa.c
s/ZPROP_INVAL/ZPOOL_PROP_INVAL/
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74Closes#7061
Authored by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
- Brought #defines in eventdefs.h in line with ZFS on Linux format.
- Updated zfs-events.5 with the new events.
OpenZFS-issue: https://www.illumos.org/issues/8959
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eeaCloses#7049
For ztest, which is solely for testing, using a pseudo random
is entirely reasonable. Using /dev/urandom ensures the system
entropy pool doesn't get depleted thus stalling the testing.
This is a particular problem when testing in VMs.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7017Closes#7036
As part of the refactoring of ab9f4b0b82,
several uint64_t-s and uint8_t-s were changed to other types. This
caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM
machine. In absense of any strong motivation for the type changes, this
simply puts them back, modulo the changes accumulated for ABD.
Compile-tested on amd64 and run-tested on armhf.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Closes#6981Closes#7023
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.
This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.
This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6171Closes#6852Closes#6989
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
PROBLEM
=======
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race *is* present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
SOLUTION
========
The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.
Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.
Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.
Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.
OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8dCloses#6940
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.
The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.
This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.
A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes#6986
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.
This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6921Closes#6926
Suppress incorrect warnings from versions of objtool which are not
aware of x86 EVEX prefix instructions used for AVX512.
module/zfs/vdev_raidz_math_avx512bw.o: warning:
objtool: <func+offset>: can't find jump dest instruction at .text
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6928