Commit Graph

2707 Commits

Author SHA1 Message Date
Alexander Motin
7f9d9e6f39
Avoid vq_lock drop in vdev_queue_aggregate()
vq_lock is already too congested for two more operations per I/O.
Instead of dropping and reacquiring it inside vdev_queue_aggregate()
delegate the zio_vdev_io_bypass() and zio_execute() calls for parent
I/Os to callers, that drop the lock any way to execute the new I/O.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12297
2021-08-17 09:47:00 -06:00
Alexander Motin
e829a865bf
Use more atomics in refcounts
Use atomic_load_64() for zfs_refcount_count() to prevent torn reads
on 32-bit platforms.  On 64-bit ones it should not change anything.

When built with ZFS_DEBUG but running without tracking enabled use
atomics instead of mutexes same as for builds without ZFS_DEBUG.
Since rc_tracked can't change live we can check it without lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12420
2021-08-17 09:44:34 -06:00
Allan Jude
e945e8d7f4
Restore FreeBSD sysctl processing for arc.min and arc.max
Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max
to a disallowed value would return an error.
Since the switch, it instead only generates WARN_IF_TUNING_IGNORED

Keep the ability to set the sysctl's specifically to 0, even though
that is less than the minimum, because some tests depend on this.

Also lost, was the ability to set vfs.zfs.arc_max to a value less
than the default vfs.zfs.arc_min at boot time. Restore this as well.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #12161
2021-08-16 09:35:19 -06:00
Tony Nguyen
6bc61d22c4
Run arc_evict thread at higher priority
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #12397
2021-08-10 11:36:26 -06:00
Alexander Motin
7eebcd2be6
Avoid small buffer copying on write
It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to
decide whether to reallocate/copy the buffer, because the answer is
OS-specific and depends on the buffer size.  Instead of that use
abd_size_alloc_linear(), moved into public header.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12425
2021-07-27 16:05:47 -07:00
Brian Behlendorf
b72611f0f6 Fix format specifier warnings
Commit 5dbf6c5a66 did not address these format specifier warnings
since they were introduced by an unrelated commit which had not
been rebased on 5dbf6c5a66 when merged.  Fix them.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12435
2021-07-27 09:50:00 -07:00
Jorgen Lundman
273730d5b5
macOS can also set va_type
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12357
2021-07-26 16:38:06 -07:00
Alexander Motin
dd3bda39cf
Add comment on metaslab_class_throttle_reserve() locking
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Issue #12314
Closes #12419
2021-07-26 16:30:20 -07:00
George Amanakis
ab8a8f0745
Fixes in persistent L2ARC
In l2arc_add_vdev() first decide whether the device is eligible for
L2ARC rebuild or whole device trim and then add it to the list of cache
devices. Otherwise l2arc_feed_thread() might already start writing on
the device invalidating previous content as l2ad_hand = l2ad_start.
However l2arc_rebuild_vdev() needs the device present in the cache
device list to figure out its l2arc_dev_t. Fix this by moving most of
l2arc_rebuild_vdev() in a new function l2arc_rebuild_dev() which does
not need to search in the cache device list.

In contrast to l2arc_add_vdev() we do not have to worry about
l2arc_feed_thread() invalidating previous content when onlining a
cache device. The device parameters (l2ad*) are not cleared when
offlining the device and writing new buffers will not invalidate
all previous content. In worst case only buffers that have not had
their log block written to the device will be lost.

Retire persist_l2arc_00{4,5,8} tests since they cover code already
covered by the remaining ones. Test persist_l2arc_006 is renamed to
persist_l2arc_004 and persist_l2arc_007 is renamed to persist_l2arc_005.

Fix a typo in persist_l2arc_004, and remove an assertion that is not
always true from l2arc_arcstats_pos. Also update an assertion in
persist_l2arc_005 and explain why in a comment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12365
2021-07-26 12:30:24 -07:00
наб
037af3e0d4 Remove NOTE(CONSTCOND) and note.h
These were mostly used to annotate do {} while(0)s

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12201
2021-07-26 12:07:53 -07:00
наб
2c69ba6444 Normalise /*FALLTHR{OUGH,U}*/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12201
2021-07-26 12:07:39 -07:00
наб
90f1c3c946 Prune /*NOTREACHED*/
This includes a simplification of mkbusy and format correctness in zhack
and ztest

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12201
2021-07-26 12:07:26 -07:00
наб
5dbf6c5a66 Replace /*PRINTFLIKEn*/ with attribute(printf)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #12201
2021-07-26 12:07:15 -07:00
Mark Johnston
1373709450 Initialize dn_next_type[] in the dnode constructor
It seems nothing ensures that this array is zeroed when a dnode is
freshly allocated, so in principle it retains the values from the
previous allocation.  In practice it seems to be the case that the
fields should end up zeroed, but we can zero the field anyway for
consistency.

This was found using KMSAN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
2021-07-26 11:53:47 -07:00
Mark Johnston
3a185275a0 Zero pad bytes following TX_WRITE log data
When logging a TX_WRITE record in the case where file data has to be
copied from the DMU, we pad the log record size to a multiple of 8
bytes.  In this case, any padding bytes should be zeroed, otherwise the
contents of uninitialized memory are written to the ZIL.

This was found using KMSAN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
2021-07-26 11:53:47 -07:00
Mark Johnston
58714c2817 Zero pad bytes when allocating a ZIL record
When allocating a record, we round up the allocation size to a multiple
of 8.  In this case, any padding bytes should be zeroed, otherwise the
contents of uninitialized memory are written to the ZIL.

This was found using KMSAN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
2021-07-26 11:53:47 -07:00
Mark Johnston
03363b2f86 Initialize all fields in zfs_log_xvattr()
When logging TX_SETATTR, we could otherwise fail to initialize part of
the corresponding ZIL record depending on which fields are present in
the xvattr.  Initialize the creation time and the AV scan timestamp to
zero so that uninitialized bytes are not written to the ZIL.

This was found using KMSAN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
2021-07-26 11:53:47 -07:00
Mark Johnston
da27b8bc7f Initialize "autoreplace" in spa_ld_get_props()
spa_prop_find() may fail to find the specified property, in which case
it suppresses ENOENT from zap_lookup().  In this case, the return value
is left uninitialized, so spa_autoreplace was being initialized using an
uninitialized stack variable.

This was found using KMSAN.  It appears to be a regression from commit
9eb7b46ed0, which removed the initialization of "autoreplace" from the
definition.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12383
2021-07-26 11:53:47 -07:00
Rich Ercolani
f1ca7999bb
Fix unfortunate NULL in spa_update_dspace
After 1325434b, we can in certain circumstances end up calling
spa_update_dspace with vd->vdev_mg NULL, which ends poorly during
vdev removal.

So let's not do that further space adjustment when we can't.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12380 
Closes #12428
2021-07-26 10:51:30 -07:00
Alexander Motin
1b50749ce9
Optimize allocation throttling
Remove mc_lock use from metaslab_class_throttle_*().  The math there
is based on refcounts and so atomic, so the only race possible there
is between zfs_refcount_count() and zfs_refcount_add().  But in most
cases metaslab_class_throttle_reserve() is called with the allocator
lock held, which covers the race.  In cases where the lock is not
held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we
do not use zfs_refcount_count().  And even if we assume some other
non-existing scenario, the worst that may happen from this race is
few more I/Os get to allocation earlier, that is not a problem.

Move locks and data of different allocators into different cache
lines to avoid false sharing.  Group spa_alloc_* arrays together
into single array of aligned struct spa_alloc spa_allocs.  Align
struct metaslab_class_allocator.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12314
2021-07-21 06:40:36 -06:00
Kevin Jin
a7bd20e309
Add Module Parameter Regarding Log Size Limit
* Add Module Parameters Regarding Log Size Limit

zfs_wrlog_data_max
The upper limit of TX_WRITE log data. Once it is reached,
write operation is blocked, until log data is cleared out
after txg sync. It only counts TX_WRITE log with WR_COPIED
or WR_NEED_COPY.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes #12284
2021-07-20 09:40:24 -06:00
Alexander Motin
8172df643b
Minor ARC optimizations
Remove unneeded global, practically constant, state pointer variables
(arc_anon, arc_mru, etc.), replacing them with macros of real state
variables addresses (&ARC_anon, &ARC_mru, etc.). 

Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special
handling in inner loop of ARC reclamation.  Respectively change bytes
argument of arc_evict_state() from int64_t to uint64_t.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12348
2021-07-20 08:13:21 -06:00
Jorgen Lundman
e04210035e
dmu_redact.c does not call bqueue_destroy
Ensure all calls to bqueue_init() has a corresponding call to bqueue_destroy()

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12118
2021-07-20 08:08:45 -06:00
Alexander
23c13c7e80
A few fixes of callback typecasting (for the upcoming ClangCFI)
* zio: avoid callback typecasting
* zil: avoid zil_itxg_clean() callback typecasting
* zpl: decouple zpl_readpage() into two separate callbacks
* nvpair: explicitly declare callbacks for xdr_array()
* linux/zfs_nvops: don't use external iput() as a callback
* zcp_synctask: don't use fnvlist_free() as a callback
* zvol: don't use ops->zv_free() as a callback for taskq_dispatch()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes #12260
2021-07-20 08:03:33 -06:00
Ryan Moeller
de12cd2511
Remove unused fields from zvol_task_t
We don't use or need the pool name or value source in the zvol tasks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #12361
2021-07-19 10:02:35 -06:00
Alexander Motin
c1b5869bab
Introduce dsl_dir_diduse_transfer_space()
Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time
is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling
them one after another is a waste of time and even more contention.
Doing that twice for each rewritten block within dbuf_write_done()
via dsl_dataset_block_kill() and dsl_dataset_block_born() created one
of the biggest CPU overheads in case of small blocks rewrite.

dsl_dir_diduse_transfer_space() combines functionality of these two
functions for cases where it is needed, but without double overhead,
practically for the cost of dsl_dir_diduse_space() or even cheaper.

While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space()
and dsl_dir_transfer_space().  It seems Clang detects some aliasing
there, repeating dd->dd_dbuf->db_data dereference multiple times,
increasing dd_lock scope and contention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Author: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12300
2021-07-16 13:39:24 -06:00
Rich Ercolani
1325434b2d
Tinker with slop space accounting with dedup
* Tinker with slop space accounting with dedup

Do not include the deduplicated space usage in the slop space
reservation, it leads to surprising outcomes.

* Update spa_dedup_dspace sometimes

Sometimes, we get into spa_get_slop_space() with
spa_dedup_dspace=~0ULL, AKA "unset", while spa_dspace is correctly set.

So call the code to update it before we use it if we hit that case.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12271
2021-07-13 09:47:57 -06:00
Alexander Motin
f7de776da2
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 09:41:59 -06:00
George Wilson
958826be7a
file reference counts can get corrupted
Callers of zfs_file_get and zfs_file_put can corrupt the reference
counts for the file structure resulting in a panic or a soft lockup.
When zfs send/recv runs, it will add a reference count to the
open file, and begin to send or recv the stream. If the file descriptor
is closed, then when dmu_recv_stream() or dmu_send() return we will
call zfs_file_put to remove the reference we placed on the file
structure. Unfortunately, because zfs_file_put() uses the file
descriptor to lookup the file structure, it may end up finding that
the file descriptor table no longer contains the file struct, thus
leaking the file structure. Or it might end up finding a file
descriptor for a different file and blindly updating its reference
counts. Other failure modes probably exists.

This change reworks the zfs_file_[get|put] interface to not rely
on the file descriptor but instead pass the zfs_file_t pointer around.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-76119
Closes #12299
2021-07-10 19:00:37 -06:00
Alexander Motin
97752ba22a
Move gethrtime() calls out of vdev queue lock
This dramatically reduces the lock contention on systems with slower
(non-TSC) timecounters.  With TSC the difference is minimal, but since
this lock is pretty congested, any improvement counts.  Plus I don't
see any reason to do it under the lock other than the latency of the
lock itself, which this change actually reduces.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12281
2021-07-06 14:38:00 -07:00
Alexander Motin
490c845efe
Compact dbuf/buf hashes and lock arrays
With default dbuf cache size of 1/32 of ARC, it makes no sense to have
hash table of the same size (or even bigger on Linux).  Reduce it to
1/8 of ARC's one, still leaving some slack, assuming higher I/O rate
via dbuf cache than via ARC.

Remove padding from ARC hash locks array.  The idea behind padding
is to avoid false sharing between locks.  It would have sense if
there would be a limited number of very busy locks.  But since we
have no limit on the number, using the same memory for more locks we
can achieve even lower lock contention with the same false sharing,
or we can use less memory for the same contention level.

Reduce number of hash locks from 8192 to 2048.  The number is still
big enough to not cause contention, but reduced memory size improves
cache hit rate for mutex_tryenter() in ARC eviction thread, saving
about 1% of the thread time.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12289
2021-07-01 09:30:31 -06:00
Jorgen Lundman
c6d1112bf4
Fix abd leak, kmem_free correct size of abd_t
Fix a leak of abd_t that manifested mostly when using
raidzN with at least as many columns as N (e.g. a
four-disk raidz2 but not a three-disk raidz2).
Sufficiently heavy raidz use would eventually run a system
out of memory.

Additionally:

* Switch abd_cache arena to FIRSTFIT, which empirically
improves perofrmance.

* Make abd_chunk_cache more performant and debuggable.

* Allocate the abd_zero_buf from abd_chunk_cache rather
than the heap.

* Don't try to reap non-existent qcaches in abd_cache arena.

* KM_PUSHPAGE->KM_SLEEP when allocating chunks from their
own arena

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: Sean Doran <smd@use.net>
Closes #12295
2021-07-01 09:28:15 -06:00
Jorgen Lundman
eca174527e
Upstream: dmu_zfetch_stream_fini leaks refcount
dmu_zfetch_stream_fini() is missing calls to destroy the refcounts,
leaking them and the mutex inside.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12294
2021-07-01 09:22:16 -06:00
Kevin Jin
50e09eddd0
Optimize txg_kick() process (#12274)
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().

Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is
also cleaned up.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes #12274
2021-07-01 09:20:27 -06:00
Alexander Motin
42afb12da7
Remove refcount from spa_config_*()
The only reason for spa_config_*() to use refcount instead of simple
non-atomic (thanks to scl_lock) variable for scl_count is tracking,
hard disabled for the last 8 years.  Switch to simple int scl_count
reduces the lock hold time by avoiding atomic, plus makes structure
fit into single cache line, reducing the locks contention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12287
2021-07-01 09:16:54 -06:00
Alexander
6a19dea7f6
module/zfs: simplify ddt_stat_add() loop
LLVM's Polly (ISL to be precise) is unhappy with the loop from
ddt_stat_add():

  CC [M]  fs/zfs/zfs/ddt.o
../lib/External/isl/isl_schedule_node.c:2470: cannot insert node
between set or sequence node and its filter children

(building with the custom patch which adds Polly support to Kbuild)

The mentioned loop is rather suboptimal. All that we need is to just
treat ddt_stat_t as an array of u64 and perform 1:1 addition or
substraction. This can be done in simpler for-loop with the
determined index and bounds. Compiler will expand d_end - d into
a number of ddt_stat_t fields at compile time.
This prevents Polly from failing on this file.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes #12253
2021-06-29 08:26:11 -06:00
Alexander Motin
5b7053a9a5
Avoid 64bit division in multilist index functions
The number of sublists in a multilist is relatively small. We dont need
64 bits to calculate an index. 32 bits is sufficient and makes the
code more efficient.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> 
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12288
2021-06-29 06:59:14 -06:00
Alexander Motin
5e2c8338bf
Help compiller optimize out abd_verify()
While abd_verify() does nothing when built without debug, compiler
can't optimize it out by itself due to calls to external list_*()
and abd_verify_scatter().  This commit makes it explicit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12280
2021-06-25 16:38:31 -07:00
Brian Behlendorf
88a4833039
Update cache file when setting compatibility property
Unlike most other properties the 'compatibility' property is stored
in the pool config object and not the DMU_OT_POOL_PROPS object.

This had the advantage that the compatibility information is available
without needing to fully import the pool (it can be read with zdb).
However, this means we need to make sure to update both the copy of
the config in the MOS and the cache file.  This wasn't being done.

This commit adds a call to spa_async_request() to ensure the copy of
the config in the cache file gets updated as well as the one stored
in the pool.  This same change is made for the 'comment' property
which suffers from the same inconsistency.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Colm Buckley <colm@tuatha.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12261 
Closes #12276
2021-06-24 14:30:02 -07:00
jumbi77
86f5e0bbce
zfs_metaslab_mem_limit should be 25 instead of 75
According to current zfs man page zfs_metaslab_mem_limit should be
25 instead of 75.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: jumbi77@users.noreply.github.com
Closes #12273
2021-06-24 10:02:54 -07:00
Rich Ercolani
8e739b2c9f
Annotated dprintf as printf-like
ZFS loves using %llu for uint64_t, but that requires a cast to not 
be noisy - which is even done in many, though not all, places.
Also a couple places used %u for uint64_t, which were promoted
to %llu. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12233
2021-06-22 21:53:45 -07:00
Antonio Russo
a81b812495
Revert Consolidate arc_buf allocation checks
This reverts commit 13fac09868.

Per the discussion in #11531, the reverted commit---which intended only
to be a cleanup commit---introduced a subtle, unintended change in
behavior.

Care was taken to partially revert and then reapply 10b3c7f5e4
which would otherwise have caused a conflict.  These changes were
squashed in to this commit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Suggested-by: @chrisrd
Suggested-by: robn@despairlabs.com
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #11531 
Closes #12227
2021-06-22 21:39:15 -07:00
Alexander Motin
29274c9f6d
Optimize small random numbers generation
In all places except two spa_get_random() is used for small values,
and the consumers do not require well seeded high quality values.
Switch those two exceptions directly to random_get_pseudo_bytes()
and optimize spa_get_random(), renaming it to random_in_range(),
since it is not related to SPA or ZFS in general.

On FreeBSD directly map random_in_range() to new prng32_bounded() KPI
added in FreeBSD 13.  On Linux and in user-space just reduce the type
used to uint32_t to avoid more expensive 64bit division.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12183
2021-06-22 17:35:23 -06:00
Alexander Motin
c4c162c1e8
Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172)
wmsum was designed exactly for cases like these with many updates
and rare reads.  It allows to completely avoid atomic operations on
congested global variables.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12172
2021-06-16 18:19:34 -06:00
George Amanakis
9ffcaa370a
Avoid deadlock when removing L2ARC devices under I/O
In case we have I/O and try to remove an L2ARC device a deadlock might
occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV
to be dropped while holding the hash_lock. However, spa_l2cache_load()
holds SCL_ALL and waits for the hash_lock in l2arc_evict().

Fix this by moving zfs_blkptr_verify() to the top top arc_read() before
the hash_lock is taken. Verify the block pointer and return a checksum
error if damaged rather than halting the system, by using
BLK_VERIFY_LOG instead of BLK_VERIFY_HALT.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12054
2021-06-16 18:17:42 -06:00
Matthew Ahrens
069bf406b4
vdev_draid_min_asize() ignores reserved space
vdev_draid_min_asize() returns the minimum size of a child vdev.  This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.

vdev_draid_min_asize() says that the child’s asize has to be at least
1/Nth of the entire draid’s asize, which is the same logic as raidz.
However, this contradicts the code in vdev_draid_open(), which
calculates the draid’s asize based on a reduced child size:

  An additional 32MB of scratch space is reserved at the end of each
  child for use by the dRAID expansion feature

So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accommodate
the additional 32MB.  The replacement is allowed and everything works at
first (since the reserved space is at the end, and we don’t try to use
it yet), but when you try to close and reopen the pool,
vdev_draid_open() calculates a smaller asize for the draid, because of
the smaller leaf, which is not allowed.

I think the confusion is that vdev_draid_min_asize() is correctly
returning the amount of required *allocatable* space in a leaf, but the
actual *size* of the leaf needs to be at least 32MB more than that.
ztest_vdev_attach_detach() assumes that it can attach that size of
device, and it actually can (the kernel/libzpool accepts it), but it
then later causes zdb to not be able to open the pool.

This commit changes vdev_draid_min_asize() to return the required size
of the leaf, not the size that draid will make available to the metaslab
allocator.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11459
Closes #12221
2021-06-13 10:48:53 -07:00
Alexander Motin
ffdf019cb3
Re-embed multilist_t storage
This commit partially reverts changes to multilists in PR 7968
(multi-threaded spa-sync()) and adds some cache line alignments to
separate read-only multilists and heavily modified refcount's to different
cache lines.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-by: iXsystems, Inc.
Closes #12158
2021-06-10 10:42:31 -06:00
Alexander Motin
371f88d96f
Remove pool io kstats (#12212)
This mostly reverts "3537 want pool io kstats" commit of 8 years ago.

From one side this code using pool-wide locks became pretty bad for
performance, creating significant lock contention in I/O pipeline.
From another, there are more efficient ways now to obtain detailed
statistics, while this statistics is illumos-specific and much less
usable on Linux and FreeBSD, reported only via procfs/sysctls.

This commit does not remove KSTAT_TYPE_IO implementation, that may
be removed later together with already unused KSTAT_TYPE_INTR and
KSTAT_TYPE_TIMER.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12212
2021-06-10 08:27:33 -07:00
Alan Somers
75b4cbf625
libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat
`getfsstat(2)` is used to retrieve the list of mounted file systems,
which libzfs uses when fetching properties like mountpoint, atime,
setuid, etc.  The `mode` parameter may be `MNT_NOWAIT`, which uses
information in the VFS's cache, or `MNT_WAIT`, which effectively does a
`statfs` on every single mounted file system in order to fetch the most
up-to-date information.  As far as I can tell, the only fields that
libzfs cares about are the filesystem's name, mountpoint, fstypename,
and mount flags.  Those things are always updated on mount and unmount,
so they will always be accurate in the VFS's mount cache except in two
circumstances:

1) When a file system is busy unmounting
2) When a ZFS file system changes the value of a mount-overridable
   property like atime or setuid, but doesn't remount the file system.
   Right now that only happens when the property is changed by an
   unprivileged user who has delegated authority to change the property
   but not to mount the dataset.  But perhaps libzfs could choose to do
   it for other reasons in the future.

Switching to `MNT_NOWAIT` will greatly improve speed with no downside,
as long as we explicitly update the mount cache whenever we change a
mount-overridable property.

For comparison, Illumos gets this information using the native
`getmntany` and `getmntent` functions, which also use cached
information.  The illumos function that would refresh the cache,
`resetmnttab`, is never called by libzfs.

And on GNU/Linux, `getmntany` and `getmntent` don't even communicate
with the kernel directly.  They simply parse the file they are given,
which is usually /etc/mtab or /proc/mounts.  Perhaps the implementation
of /proc/mounts is synchronous, ala MNT_WAIT; I don't know.

Sponsored-by:	Axcient
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes: #12091
2021-06-08 07:36:43 -06:00
наб
f423411c44 module/zfs: vdev_removal: spa_vdev_remove_thread: remove unused variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12187
2021-06-07 20:58:56 -07:00
наб
6bab6ee838 module/zfs: vdev_indirect: vdev_indirect_repair: remove unused variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12187
2021-06-07 20:58:51 -07:00
наб
f70f3e2fbc module/zfs: dbuf: dbuf_read_impl: remove unused variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12187
2021-06-07 20:58:47 -07:00
наб
f719d3b160 module/zfs: arc: arc_hdr_realloc_crypt: remove unused variables
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12187
2021-06-07 20:58:42 -07:00
Serapheim Dimitropoulos
86b5f4c121
Livelist logic should handle dedup blkptrs
Update the logic to handle the dedup-case of consecutive
FREEs in the livelist code. The logic still ensures that
all the FREE entries are matched up with a respective
ALLOC by keeping a refcount for each FREE blkptr that we
encounter and ensuring that this refcount gets to zero
by the time we are done processing the livelist.

zdb -y no longer panics when encountering double frees

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #11480
Closes #12177
2021-06-07 13:09:07 -06:00
Alexander Motin
ea400129c3
More aggsum optimizations
- Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
Previous code was excessively strong on 64bit systems while not
strong enough on 32bit ones.  Instead introduce and use real
atomic_load() and atomic_store() operations, just an assignments
on 64bit machines, but using proper atomics on 32bit ones to avoid
torn reads/writes.

 - Reduce number of buckets on large systems.  Extra buckets not as
much improve add speed, as hurt reads.  Unlike wmsum for aggsum
reads are still important.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12145
2021-06-07 09:02:47 -07:00
Alexander Motin
86706441a8
Introduce write-mostly sums
wmsum counters are a reduced version of aggsum counters, optimized for
write-mostly scenarios.  They do not provide optimized read functions,
but instead allow much cheaper add function.  The primary usage is
infrequently read statistic counters, not requiring exact precision.

The Linux implementation is directly mapped into percpu_counter KPI.
The FreeBSD implementation is directly mapped into counter(9) KPI.
In user-space due to lack of better implementation mapped to aggsum.

Unfortunately neither Linux percpu_counter nor FreeBSD counter(9)
provide sufficient functionality to completelly replace aggsum, so
it still remains to be used for several hot counters.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12114
2021-05-27 14:27:29 -06:00
Alexander Motin
2041d6eecd
Improve scrub maxinflight_bytes math.
Previously, ZFS scaled maxinflight_bytes based on total number of
disks in the pool.  A 3-wide mirror was receiving a queue depth of 3
disks, which it should not, since it reads from all the disks inside.
For wide raidz the situation was slightly better, but still a 3-wide
raidz1 received a depth of 3 disks instead of 2.

The new code counts only unique data disks, i.e. 1 disk for mirrors
and non-parity disks for raidz/draid.  For draid the math is still
imperfect, since vdev_get_nparity() returns number of parity disks
per group, not per vdev, but still some better than it was.

This should slightly reduce scrub influence on payload for some pool
topologies by avoiding excessive queuing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored-By:	iXsystems, Inc.
Closing #12046
2021-05-27 10:11:39 -06:00
vermavipinkumar
dce1bf99ec
Propagate vdev state due to invalid label corruption
Propagate vdev child state to parents on invalid label
Add VDEV_AUX_BAD_LABEL to print_import_config()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com>
Signed-off-by: Vipin Kumar Verma <vipin.verma@hpe.com>
Closes #12088
2021-05-25 12:32:07 -06:00
Brian Behlendorf
8fb577ae6d
Fix dRAID sequential resilver silent damage handling
This change addresses two distinct scenarios which are possible
when performing a sequential resilver to a dRAID pool with vdevs
that contain silent unknown damage. Which in this circumstance
took the form of the devices being intentionally overwritten with
zeros. However, it could also result from a device returning incorrect
data while a sequential resilver was in progress.

Scenario 1) A sequential resilver is performed while all of the
dRAID vdevs are ONLINE and there is silent damage present on the
vdev being resilvered. In this case, nothing will be repaired
by vdev_raidz_io_done_reconstruct_known_missing() because
rc->rc_error isn't set on any of the raid columns. To address
this vdev_draid_io_start_read() has been updated to always mark
the resilvering column as ESTALE for sequential resilver IO.

Scenario 2) Multiple columns contain silent damage for the same
block and a sequential resilver is performed. In this case it's
impossible to generate the correct data from parity unless all of
the damaged columns are being sequentially resilvered (and thus
only good data is used to generate parity). This is as expected
and there's nothing which can be done about it. However, we need
to be careful not to make to situation worse. Since we can't
verify the data is actually good without a checksum, we must
only repair the devices which are being sequentially resilvered.
Otherwise, an incorrect repair to a device which previously
contained good data could effectively lock in the damage and
make reconstruction impossible. A check for this was added to
vdev_raidz_io_done_verified() along with a new test case.

Lastly, this change updates the redundancy_draid_spare1 and
redundancy_draid_spare3 test cases to be more representative
of normal dRAID replacement operation.  Specifically, what we
care about is that the scrub run after a sequential resilver
does not find additional blocks which need repair.  This would
indicate the sequential resilver failed to rebuild a section of
one of the devices. Note also the tests were switched to using
the verify_pool() function which still checks for checksum errors.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12061
2021-05-20 15:05:26 -07:00
Alexander Motin
7457b024ba
Scale worker threads and taskqs with number of CPUs
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11966
2021-05-14 09:13:53 -07:00
Paul Zuchowski
fce29d6aa4
Fix dmu_recv_stream test for resumable
Use dsl_dataset_has_resume_receive_state()
not dsl_dataset_is_zapified() to check if
stream is resumable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #12034
2021-05-13 21:46:14 -07:00
Brian Behlendorf
6217656da3
Revert "Fix raw sends on encrypted datasets when copying back snapshots"
Commit d1d4769 takes into account the encryption key version to
decide if the local_mac could be zeroed out. However, this could lead
to failure mounting encrypted datasets created with intermediate
versions of ZFS encryption available in master between major releases.
In order to prevent this situation revert d1d4769 pending a more
comprehensive fix which addresses the mount failure case.

Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11294
Issue #12025
Issue #12300
Closes #12033
2021-05-13 10:00:17 -07:00
наб
38c6d6cedd
module/zfs: remove zfs_zevent_console and zfs_zevent_cols
zfs_zevent_console committed multiple printk()s per line without
properly continuing them ‒ a single event could easily be fragmented
across over thirty lines, making it useless for direct application

zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console

The niche this was supposed to fill can be better served by something
akin to the all-syslog ZEDLET

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #7082 
Closes #11996
2021-05-10 11:00:15 -07:00
Brian Behlendorf
93c8e91fe7
Fix dRAID self-healing short columns
When dRAID performs a normal read operation only the data columns
in the raid map are read from disk.  This is enough information to
calculate the checksum, verify it, and return the needed data to the
application.  It's only in the event of a checksum failure that the
additional parity and any empty columns must be read since they are
required for parity reconstruction.

Reading these additional columns is handled by vdev_raidz_read_all()
which calls vdev_draid_map_alloc_empty() to expand the raid_map_t
and submit IOs for the missing columns.  This all works correctly,
but it fails to account for any "short" columns.  These are data
columns which are padded with a empty skip sector at the end.
Since that empty sector is not needed for a normal read it's not
read when columns is first read from disk.  However, like the parity
and empty columns the skip sector is needed to perform reconstruction.

The fix is to mark any "short" columns as never being read by clearing
the rc_tried flag when expanding the raid_map_t.  This will cause
the entire column to re-read from disk in the event of a checksum
failure allowing the self-healing functionality to repair the block.

Note that this only effects the self-healing feature because when
scrubbing a pool the parity, data, and empty columns are all read
initially to verify their contents.  Furthermore, only blocks which
contain "short" columns would be effected, and only when the memory
backing the skip sector wasn't already zeroed out.

This change extends the existing redundancy_raidz.ksh test case to
verify self-healing (as well as resilver and scrub).  Then applies
the same test case to dRAID with a slightly modified version of
the test script called redundancy_draid.ksh.  The unused variable
combrec was also removed from both test cases.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12010
2021-05-08 08:57:25 -07:00
Alexander Motin
4fb9e5638b
Simplify/fix dnode_move() for dn_zfetch
Previous code tried to keep prefetch streams while moving dnode.  But
it was at least not updating per-stream zs_fetchback pointers, causing
use-after-free on next access.  Instead of that I see much easier and
cleaner to just drop old prefetch state and start new from scratch.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11936
Closes #11998
2021-05-07 15:07:03 -07:00
Nathaniel Wesley Filardo
056a658dee
vdev_mirror: don't scrub/resilver devices that can't be read
This ensures that we don't accumulate checksum errors against offline or
unavailable devices but, more importantly, means that we don't
needlessly create DTL entries for offline devices that are already
up-to-date.

Consider a 3-way mirror, with disk A always online (and so always with
an empty DTL) and B and C only occasionally online.  When A & B resilver
with C offline, B's DTL will effectively be appended to C's due to these
spurious ZIOs even as the resilver empties B's DTL:

  * These ZIOs land in vdev_mirror_scrub_done() and flag an error

  * That flagged error causes vdev_mirror_io_done() to see
    unexpected_errors, so it issues a ZIO_TYPE_WRITE repair ZIO, which
    inherits ZIO_FLAG_SCAN_THREAD because zio_vdev_child_io() includes
    that flag in ZIO_VDEV_CHILD_FLAGS.

  * That ZIO fails, too, and eventually zio_done() gets its hands on it
    and calls vdev_stat_update().

  * vdev_stat_update() sees the error and this zio...

    * is not speculative,
    * is not due to EIO (but rather ENXIO, since the device is closed)
    * has an ->io_vd != NULL (specifically, the offline leaf device)
    * is a write
    * is for a txg != 0 (but rather the read block's physical birth txg)
    * has ZIO_FLAG_SCAN_THREAD asserted

  * So: vdev_stat_update() calls vdev_dtl_dirty() on the offline vdev.

Then, when A & C resilver with B offline, that story gets replayed and
C's DTL will be appended to B's.

In fact, one does not need this permanently-broken-mirror scenario to
induce badness: breaking a mirror with no DTLs and then scrubbing will
create DTLs for all offline devices.  These DTLs will persist until the
entire mirror is reassembled for the duration of the *resilver*, which,
incidentally, will not consider the devices with good data to be sources
of good data in the case of a read failure.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nathaniel Wesley Filardo <nwfilardo@gmail.com>
Closes #11930
2021-04-27 17:48:11 -07:00
Mateusz Guzik
309c32c954
Combine zio caches if possible
This deduplicates 2 sets of caches which use the same allocation size.

Memory savings fluctuate a lot, one sample result is FreeBSD running
"make buildworld" saving ~180MB RAM in reduced page count associated
with zio caches.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11877
2021-04-17 12:36:04 -07:00
Paul Zuchowski
f2286383d0
Fix crash in zio_done error reporting
Fix NULL pointer dereference when reporting
checksum error for gang block in zio_done.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #11872
Closes #11896
2021-04-16 11:00:53 -07:00
наб
375bdb2b20 module/zfs/zvol.c: purge unused zvol_volmode_cb_arg
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #11879
2021-04-15 14:55:37 -07:00
Jitendra Patidar
08795ab8d3
ZFS traverse_visitbp optimization to limit prefetch
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #11802 
Closes #11803
2021-04-15 13:49:27 -07:00
Brian Behlendorf
888700bc6b
ZTS: fix removal_condense_export test case
It's been observed in the CI that the required 25% of obsolete bytes
in the mapping can be to high a threshold for this test resulting in
condensing never being triggered and a test failure.  To prevent these
failures make the existing zfs_condense_indirect_obsolete_pct tuning
available so the obsolete percentage can be reduced from 25% to 5%
during this test.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11869
2021-04-11 21:49:13 -07:00
pstef
458f82319a
Balance parentheses in parameter descriptions
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Piotr Paweł Stefaniak <pstef@freebsd.org>
Closes #11882
2021-04-11 16:35:07 -07:00
Ryan Moeller
a631283b74 Move zfsdev_state_{init,destroy} to common code
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11833
2021-04-08 21:17:43 -07:00
Ryan Moeller
1dff545278 Eliminate zfsdev_get_state_impl
After 3937ab20f zfsdev_get_state_impl can become zfsdev_get_state.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11833
2021-04-08 21:17:18 -07:00
Brian Behlendorf
600a1dc54c
Use dsl_scan_setup_check() to setup a scrub
When a rebuild completes it will automatically schedule a follow up
scrub to verify all of the block checksums.  Before setting up the
scrub execute the counterpart dsl_scan_setup_check() function to
confirm the scrub can be started.  Prior to this change we'd only
check vdev_rebuild_active() which isn't as comprehensive, and using
the check function keeps all of this logic in one place.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11849
2021-04-08 14:33:15 -07:00
Ryan Moeller
e778b0485b
Ratelimit deadman zevents as with delay zevents
Just as delay zevents can flood the zevent pipe when a vdev becomes
unresponsive, so do the deadman zevents.

Ratelimit deadman zevents according to the same tunable as for delay
zevents.

Enable deadman tests on FreeBSD and add a test for deadman event
ratelimiting. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11786
2021-04-07 16:23:57 -07:00
Andrea Gelmini
bf169e9f15 Fix various typos
Correct an assortment of typos throughout the code base.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11774
2021-04-02 18:52:15 -07:00
Ryan Moeller
032a213e2e Don't scale zfs_zevent_len_max by CPU count
The lower bound for this scaling to too low and the upper bound is too
high.  Use a fixed default length of 512 instead, which is a reasonable
value on any system.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11822
2021-04-01 08:45:04 -07:00
Ryan Moeller
3ba10f9a6a Atomically check and set dropped zevent count
ratelimit_dropped isn't protected by a lock and is expected to
be updated atomically.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11822
2021-04-01 08:43:01 -07:00
Matthew Ahrens
2b56a63457
Use a helper function to clarify gang block size
For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the
gang DVA including its children BP's.  The space allocated at each DVA's
vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`.

This commit makes this relationship more clear by using a helper
function, `vdev_gang_header_asize()`, for the space allocated at the
gang block's vdev/offset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11744
2021-03-26 11:19:35 -07:00
Andrea Gelmini
8a915ba1f6
Removed duplicated includes
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #11775
2021-03-22 12:34:58 -07:00
Alexander Motin
891568c990
Split dmu_zfetch() speculation and execution parts
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random.  But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.

This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run().  The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued.  It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.

For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients.  Benefits of those ways
varied, but neither was perfect.  With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.

While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock.  Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.

Delete prefetch streams when they reach ends of files.  It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.

Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache.  First cache miss immediately fires all the prefetch
that would be done for the stream by that time.  It saves some CPU
time if same files within DMU cache capacity are read over and over.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11652
2021-03-19 22:56:11 -07:00
Chunwei Chen
296a4a369b
Fix zfs_get_data access to files with wrong generation
If TX_WRITE is create on a file, and the file is later deleted and a new
directory is created on the same object id, it is possible that when
zil_commit happens, zfs_get_data will be called on the new directory.
This may result in panic as it tries to do range lock.

This patch fixes this issue by record the generation number during
zfs_log_write, so zfs_get_data can check if the object is valid.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #10593
Closes #11682
2021-03-19 22:53:31 -07:00
Andrew
66e6d3f128
Fix regression in POSIX mode behavior
Commit 235a85657 introduced a regression in evaluation of POSIX modes
that require group DENY entries in the internal ZFS ACL. An example
of such a POSX mode is 007. When write_implies_delete_child is set,
then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling
zfs_zaccess_common(). This occurs is zfs_zaccess_delete().

Unfortunately, when zfs_zaccess_aces_check hits this particular DENY
ACE, zfs_groupmember() is checked to determine whether access should be
denied, and since zfs_groupmember() always returns B_TRUE on Linux and
so this check is failed, resulting ultimately in EPERM being returned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes #11760
2021-03-19 22:50:46 -07:00
Martin Matuška
cd5b812818
Allow setting bootfs property on pools with indirect vdevs
The FreeBSD boot loader relies on the bootfs property and is capable
of booting from removed (indirect) vdevs.

Reviewed-by Eric van Gyzen
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #11763
2021-03-19 22:46:43 -07:00
Serapheim Dimitropoulos
793c958f6f
Initialize metaslab range trees in metaslab_init
= Motivation

We've noticed several zloop crashes within Delphix generated
due to the following sequence of events:

- A device gets expanded and new metaslabas are allocated for
  it. These metaslabs go through `metaslab_init()` but haven't
  gone through `metaslab_sync_done()` yet. This meas that the
  only range tree that's actually set is the `ms_allocatable`.
  All the others are NULL.

- A vdev_initialization is issues and `vdev_initialize_thread`
  starts processing one of these new metaslabs of the expanded
  vdev.

- As part of `vdev_initialize_calculate_progress()` we call
  into `metaslab_load()` and `metaslab_load_impl()` which
  in turn tries to dereference the metaslabs trees that
  are still NULL and therefore we crash.

The same failure can come up from the `vdev_trim` code paths.

= This Patch

We considered the following solutions to deal with this issue:

[A] Add logic to `vdev_initialize/trim` to skip those new
    metaslabs. We decided against this as it would be good
    to avoid exposing this lower-level detail to higer-level
    operations.

[B] Have `metaslab_load_impl()` return early for new metaslabs
    and thus never touch those range_trees that are NULL at
    that time. This seemed more of a work-around for the bug
    and not a clear-cut solution.

[C] Refactor our logic so all metaslabs have their range_trees
    created at the time of their creatin in `metaslab_init()`.

In this patch we decided to go with [C] because:

(1) It doesn't expose more metaslab details to higher level
    operations such as vdev initialize and trim.

(2) The current behavior of creating the range trees lazily
    in `metaslab_sync_done()` is unnecessarily complicated.

(3) Always initializing the metaslab range_trees makes other
    parts of the codebase cleaner. For example, we used to
    use `ms_freed` as the reference value for knowing whether
    all the range_trees have been initialized. Now we no
    longer need to do that check in most places (and in the
    few that we do we use the `ms_new` boolean field now
    which is more readable).

= Side Changes

Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice
in `metasloab_load_impl()`. In this patch we remove the extraneous
assignment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #11737
2021-03-19 22:36:02 -07:00
Matthew Ahrens
330c6c0523
Clean up RAIDZ/DRAID ereport code
The RAIDZ and DRAID code is responsible for reporting checksum errors on
their child vdevs.  Checksum errors represent events where a disk
returned data or parity that should have been correct, but was not.  In
other words, these are instances of silent data corruption.  The
checksum errors show up in the vdev stats (and thus `zpool status`'s
CKSUM column), and in the event log (`zpool events`).

Note, this is in contrast with the more common "noisy" errors where a
disk goes offline, in which case ZFS knows that the disk is bad and
doesn't try to read it, or the device returns an error on the requested
read or write operation.

RAIDZ/DRAID generate checksum errors via three code paths:

1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
reported on any children whose data was not used during the
reconstruction.  This is handled in `raidz_reconstruct()`.  This is the
most common type of RAIDZ/DRAID checksum error.

2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
means that the data has been lost.  The zio fails and an error is
returned to the consumer (e.g. the read(2) system call).  This would
happen if, for example, three different disks in a RAIDZ2 group are
silently damaged.  Since the damage is silent, it isn't possible to know
which three disks are damaged, so a checksum error is reported against
every child that returned data or parity for this read.  (For DRAID,
typically only one "group" of children is involved in each io.)  This
case is handled in `vdev_raidz_cksum_finish()`. This is the next most
common type of RAIDZ/DRAID checksum error.

3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
case 2), but there happens to be additional copies of this block due to
"ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
copies is good, then RAIDZ/DRAID compares each sector of the data or
parity that it retrieved with the good data from the other DVA, and if
they differ then it reports a checksum error on this child.  This
differs from case 2 in that the checksum error is reported on only the
subset of children that actually have bad data or parity.  This case
happens very rarely, since normally only metadata has ditto blocks.  If
the silent damage is extensive, there will be many instances of case 2,
and the pool will likely be unrecoverable.

The code for handling case 3 is considerably more complicated than the
other cases, for two reasons:

1. It needs to run after the main raidz read logic has completed.  The
data RAIDZ read needs to be preserved until after the alternate DVA has
been read, which necessitates refcounts and callbacks managed by the
non-raidz-specific zio layer.

2. It's nontrivial to map the sections of data read by RAIDZ to the
correct data.  For example, the correct data does not include the parity
information, so the parity must be recalculated based on the correct
data, and then compared to the parity that was read from the RAIDZ
children.

Due to the complexity of case 3, the rareness of hitting it, and the
minimal benefit it provides above case 2, this commit removes the code
for case 3.  These types of errors will now be handled the same as case
2, i.e. the checksum error will be reported against all children that
returned data or parity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11735
2021-03-19 16:22:10 -07:00
Matthew Ahrens
46df6e98aa
Remove unused rr_code
The `rr_code` field in `raidz_row_t` is unused.

This commit removes the field, as well as the code that's used to set
it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11736
2021-03-17 21:57:09 -07:00
Don Brady
dd0b5c8559
Reference_tracking_enable should be a module param
To make use of zfs_refcount_held tunable it should be a module 
parameter in open-zfs.  Also, since the macros will auto-generate OS 
specific tunables, removed the existing zfs_refcount_held reference 
in module/os/freebsd/zfs/sysctl_os.c.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #11753
2021-03-16 14:56:17 -07:00
Mateusz Guzik
5ebe425a5b Macroify teardown lock handling
This will allow platforms to implement it as they see fit, in particular
in a different manner than rrm locks.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11153
2021-03-12 15:51:39 -08:00
Ryan Moeller
35aa9dc6df
FreeBSD: Fix scope of deadman tunables
A few deadman tunables ended up in the wrong sysctl node.

Move them to vfs.zfs.deadman.*

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11715
2021-03-11 19:23:24 -08:00
Christian Schwarz
93e3658035
zvol: call zil_replaying() during replay
zil_replaying(zil, tx) has the side-effect of informing the ZIL that an
entry has been replayed in the (still open) tx.  The ZIL uses that
information to record the replay progress in the ZIL header when that
tx's txg syncs.

ZPL log entries are not idempotent and logically dependent and thus
calling zil_replaying() is necessary for correctness.

For ZVOLs the question of correctness is more nuanced: ZVOL logs only
TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical
dependencies between two records exist only if the write or discard
request had sync semantics or if the ranges affected by the records
overlap.

Thus, at a first glance, it would be correct to restart replay from
the beginning if we crash before replay completes. But this does not
address the following scenario:
Assume one log record per LWB.
The chain on disk is

    HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z")

where N:W(O, C) represents log entry number N which is a TX_WRITE of C
to offset A.
We replay 1, 2 and 3 in one txg, sync that txg, then crash.
Bit flips corrupt 2, 3, and 4.
We come up again and restart replay from the beginning because
we did not call zil_replaying() during replay.
We replay 1 again, then interpret 2's invalid checksum as the end
of the ZIL chain and call replay done.
The replayed zvol content is "AX".

If we had called zil_replaying() the HDR would have pointed to 3
and our resumed replay would not have replayed anything because
3 was corrupted, resulting in zvol content "BX".

If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's
contents.

This patch adds the zil_replaying() calls to the replay functions.
Since the callbacks in the replay function need the zilog_t* pointer
so that they can call zil_replaying() we open the ZIL while
replaying in zvol_create_minor(). We also verify that replay has
been done when on-demand-opening the ZIL on the first modifying
bio.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11667
2021-03-07 09:49:58 -08:00
Ryan Moeller
4b2e20824b
Intentionally allow ZFS_READONLY in zfs_write
ZFS_READONLY represents the "DOS R/O" attribute.
When that flag is set, we should behave as if write access
were not granted by anything in the ACL.  In particular:
We _must_ allow writes after opening the file r/w, then
setting the DOS R/O attribute, and writing some more.
(Similar to how you can write after fchmod(fd, 0444).)

Restore these semantics which were lost on FreeBSD when refactoring
zfs_write.  To my knowledge Linux does not actually expose this flag,
but we'll need it to eventually so I've added the supporting checks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11693
2021-03-07 09:31:52 -08:00
Jorgen Lundman
8a6d444825
Fix abd_get_offset_struct() may allocate new abd
Even when supplied with an abd to abd_get_offset_struct(), the call
to abd_get_offset_impl() can allocate a different abd. Ensure to
call abd_fini_struct() on the abd that is not used.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #11683
2021-03-05 12:22:57 -08:00
nssrikanth
bedbc13daa
Cancel TRIM / initialize on FAULTED non-writeable vdevs
When a device which is actively trimming or initializing becomes
FAULTED, and therefore no longer writable, cancel the active
TRIM or initialization.  When the device is merely taken offline
with `zpool offline` then stop the operation but do not cancel it.
When the device is brought back online the operation will be
resumed if possible.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Vipin Kumar Verma <vipin.verma@hpe.com>
Signed-off-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com>
Closes #11588
2021-03-02 10:27:27 -08:00
Brian Behlendorf
8e43fa12c5
Fix vdev_rebuild_thread deadlock
The metaslab_disable() call may block waiting for a txg sync.
Therefore it's important that vdev_rebuild_thread release the
SCL_CONFIG read lock it is holding before this call.  Failure
to do so can result in the txg_sync thread getting blocked
waiting for this lock which results in a deadlock.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewd-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11647
2021-02-24 10:01:00 -08:00
Brian Behlendorf
75a089ed34
Fix overly broad locking in spa_vdev_config_exit()
Calling vdev_free() only requires the we acquire the spa config
SCL_STATE_ALL locks, not the SCL_ALL locks.  In particular, we need
need to avoid taking the SCL_CONFIG lock (included in SCL_ALL) as a
writer since this can lead to a deadlock.  The txg_sync_thread() may
block in spa_txg_history_init_io() when taking the SCL_CONFIG lock
as a reading when it detects there's a pending writer.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11585
2021-02-24 10:00:21 -08:00
Prakash Surya
f01eaed455
Add upper bound for slop space calculation
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #11023
2021-02-24 09:52:43 -08:00
Ryan Moeller
5156862960
Wrap bare EINVAL returns with SET_ERROR
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11636
2021-02-24 09:51:10 -08:00
fbynite
11f2e9a491
vdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL
This prevents a panic after a SLOG add/removal on the root pool followed
by a zpool scrub.

When a SLOG is removed, a hole takes its place - the vdev_ops for a hole
is vdev_hole_ops, which defines the handler functions of vdev_op_hold
and vdev_op_rele as NULL.

This bug has been reported in illumos and FreeBSD, a different trigger
in the FreeBSD report though.

Credit for this patch goes to Patrick Mooney <pmooney@pfmooney.com>

Obtained from: illumos-gate commit: c65bd18728f34725
External-issue: https://www.illumos.org/issues/12981
External-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252396
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.fx907@gmail.com>
Closes #11623
2021-02-20 20:19:20 -08:00
Brian Atkinson
c0801bf35a
Cleaning up uio headers
Making uio_impl.h the common header interface between Linux and FreeBSD
so both OS's can share a common header file. This also helps reduce code
duplication for zfs_uio_t for each OS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11622
2021-02-20 20:16:50 -08:00
Ryan Moeller
64e0fe14ff
Restore FreeBSD resource usage accounting
Add zfs_racct_* interfaces for platform-dependent read/write accounting.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11613
2021-02-19 22:34:33 -08:00
Don Brady
03e02e5b56
Checksum errors may not be counted
Fix regression seen in issue #11545 where checksum errors 
where not being counted or showing up in a zpool event.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #11609
2021-02-19 22:33:15 -08:00
Colm
658fb8020f
Add "compatibility" property for zpool feature sets
Property to allow sets of features to be specified; for compatibility
with specific versions / releases / external systems. Influences
the behavior of 'zpool upgrade' and 'zpool create'. Initial man
page changes and test cases included.

Brief synopsis:

zpool create -o compatibility=off|legacy|file[,file...] pool vdev...

compatibility = off : disable compatibility mode (enable all features)
compatibility = legacy : request that no features be enabled
compatibility = file[,file...] : read features from specified files.
Only features present in *all* files will be enabled on the
resulting pool. Filenames may be absolute, or relative to
/etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc
checked first).

Only affects zpool create, zpool upgrade and zpool status.

ABI changes in libzfs:

* New function "zpool_load_compat" to load and parse compat sets.
* Add "zpool_compat_status_t" typedef for compatibility parse status.
* Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum
* Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum

An initial set of base compatibility sets are included in
cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is
modified to install these in $pkgdatadir/compatibility.d and to
create symbolic links to a reasonable set of aliases.

Reviewed-by: ericloewe
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #11468
2021-02-17 21:30:45 -08:00
khng300
fc273894d2
Rename zfs_inode_update to zfs_znode_update_vfs
zfs_znode_update_vfs is a more platform-agnostic name than
zfs_inode_update. Besides that, the function's prototype is moved to
include/sys/zfs_znode.h as the function is also used in common code.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ka Ho Ng <khng300@gmail.com>
Sponsored by: The FreeBSD Foundation
Closes #11580
2021-02-09 11:17:29 -08:00
Antonio Russo
f8ce8aed0c
Set file mode during zfs_write
3d40b65 refactored zfs_vnops.c, which shared much code verbatim between
Linux and BSD.  After a successful write, the suid/sgid bits are reset,
and the mode to be written is stored in newmode.  On Linux, this was
propagated to both the in-memory inode and znode, which is then updated
with sa_update.

3d40b65 accidentally removed the initialization of newmode, which
happened to occur on the same line as the inode update (which has been
moved out of the function).

The uninitialized newmode can be saved to disk, leading to a crash on
stat() of that file, in addition to a merely incorrect file mode.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #11474 
Closes #11576
2021-02-08 09:15:05 -08:00
Christian Schwarz
84268b099b Document monotonicity of dmu_tx_assign() and txg_hold_open()
Expand the comments to make it clear exactly what is guaranteed
by dmu_tx_assign() and txg_hold_open().  Additionally, update
the comment which refers to txg_exit() when it should reference
txg_rele_to_sync().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11521
2021-02-02 10:11:37 -08:00
Matthew Ahrens
2d4bbd14fc
The abd child/parent relationship does not need to be tracked
ABD's currently track their parent/child relationship.  This applies to
`abd_get_offset()` and `abd_borrow_buf()`.  However, nothing depends on
knowing this relationship, it's only used for consistency checks to
verify that we are not destroying an ABD that's still in use.  When we
are creating/destroying ABD's frequently, the performance impact of
maintaining these data structures (in particular the atomic
increment/decrement operations) can be measurable.

This commit removes this verification code on production builds, but
keeps it when ZFS_DEBUG is set.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11535
2021-01-30 10:04:42 -08:00
Brian Atkinson
2993698eb3
Fixing gang ABD when adding another gang
I originally applied a fix in #11539 to fix a parent's child references
when a gang ABD is free'd. However, I did not take into account
abd_gang_add_gang(). We still need to make sure to update the child
references in this function as well. In order to resolve this I removed
decreasing the gang ABD's size in abd_free_gang() as well as moved back
the original placeent of zfs_refcount_remove_many() in abd_free().

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11542
2021-01-28 16:54:12 -08:00
George Amanakis
0ae184a6ba
Avoid updating the L2ARC device header unnecessarily
If we do not write any buffers to the cache device and the evict hand
has not advanced do not update the cache device header.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #11522 
Closes #11537
2021-01-28 09:20:03 -08:00
Brian Atkinson
416015ef54
Removing ABD Parent Child Reference Before Freeing ABD
Moving the call to zfs_refcount_remove_many() in abd_free() to be called
before any of the ABD free variants are called. This is necessary
because abd_free_gang() adjusts the abd_size for the gang ABD. If the
parent's child references are removed after free'ing the gang ABD the
refcount is not adjusted correctly for the parent's children.

I also removed some stray abd_put() in comments and changed
abd_free_gang_abd() -> abd_free_gang().

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11539
2021-01-28 09:15:17 -08:00
Mark Maybee
b2c5904a78
Revert special case code from pre-hashtable nvlist era
Before a hash table was added on top of the nvlist code, there were
cases where the nvlist allocation was changed from fnvlist_alloc()
to nvlist_alloc() to avoid expensive NV_UNIQUE_NAME checks. Now
this is no longer necessary. These changes should be reverted to be
consistent with other code. There are some cases where this change
will also reduce the number of iterations.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Maybee <mark.maybee@delphix.com>
Closes #11464
2021-01-27 21:31:51 -08:00
Alan Somers
cf0977ad72 Parallelize vdev_validate
The runtime of vdev_validate is dominated by the disk accesses in
vdev_label_read_config.  Speed it up by validating all vdevs in
parallel using a taskq.

Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #11470
2021-01-26 19:36:51 -08:00
Alan Somers
67874d5487 Read all disk labels concurrently in vdev_label_read_config
This is similar to what we already do in vdev_geom_read_config.

Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #11470
2021-01-26 19:36:02 -08:00
Alan Somers
a0e01997ec Parallelize vdev_load
metaslab_init is the slowest part of importing a mature pool, and it
must be repeated hundreds of times for each top-level vdev.  But its
speed is dominated by a few serialized disk accesses.  That can lead to
import times of > 1 hour for pools with many top-level vdevs on spinny
disks.

Speed up the import by using a taskqueue to parallelize vdev_load across
all top-level vdevs.

This also requires adding mutex protection to
metaslab_class_t.mc_historgram.  The mc_histogram fields were
unprotected when that code was first written in "Illumos 4976-4984 -
metaslab improvements" (OpenZFS
f3a7f6610f).  The lock wasn't added until
3dfb57a35e, though it's unclear exactly
which fields it's supposed to protect.  In any case, it wasn't until
vdev_load was parallelized that any code attempted concurrent access to
those fields.

Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #11470
2021-01-26 19:35:59 -08:00
Matthew Ahrens
62d4287f27
RAIDZ2/3 fails to heal silently corrupted parity w/2+ bad disks
When scrubbing, (non-sequential) resilvering, or correcting a checksum
error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by
overwriting it.  For example, if P disks are silently corrupted (P being
the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub`
should detect and heal all the bad state on these disks, including
parity.  This way if there is a subsequent failure we are fully
protected.

With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity
sector, and also damage (silent or known) to a data sector.  In this
case the parity should be healed but it is not.

The problem can be noticed by scrubbing the pool twice.  Assuming there
was no damage concurrent with the scrubs, the first scrub should fix all
silent damage, and the second scrub should be "clean" (`zpool status`
should not report checksum errors on any disks).  If the bug is
encountered, then the second scrub will repair the silently-damaged
parity that the first scrub failed to repair, and these checksum errors
will be reported after the second scrub.  Since the first scrub repaired
all the damaged data, the bug can not be encountered during the second
scrub, so subsequent scrubs (more than two) are not necessary.

The root cause of the problem is some code that was inadvertently added
to `raidz_parity_verify()` by the DRAID changes.  The incorrect code
causes the parity healing to be aborted if there is damaged data
(`rc_error != 0`) or the data disk is not present (`!rc_tried`).  These
checks are not necessary, because we only call `raidz_parity_verify()`
if we have the correct data (which may have been reconstructed using
parity, and which was verified by the checksum).

This commit fixes the problem by removing the incorrect checks in
`raidz_parity_verify()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11489 
Closes #11510
2021-01-26 16:05:05 -08:00
Will Andrews
f4f50a7048
spa_export_common: refactor common exit points
Create a common exit point for spa_export_common (a very long 
function), which avoids missing steps on failure.  This work
is helpful for the planned forced pool export changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Will Andrews <will@firepipe.net>
Closes #11514
2021-01-25 15:04:11 -08:00
Colm
4a90d4d6fc
Fix two minor lint errors (cppcheck)
Fix two minor errors reported by cppcheck:

In module/zfs/abd.c (abd_get_offset_impl), add non-NULL
assertion to prevent NULL dereference warning.

In module/zfs/arc.c (l2arc_write_buffers), change 'try'
variable to 'pass' to avoid C++ reserved word.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #11507
2021-01-23 15:49:32 -08:00
Alexander Motin
5aa69a57da
Relax special_small_blocks assertion.
Follow up for commit 624222a, value asserted <= SPA_OLD_MAXBLOCKSIZE
instead of SPA_MAXBLOCKSIZE as it should be after the previous change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11501
2021-01-23 15:45:27 -08:00
Matthew Ahrens
aa755b3549
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-21 15:12:54 -08:00
Brian Atkinson
d0cd9a5cc6
Extending FreeBSD UIO Struct
In FreeBSD the struct uio was just a typedef to uio_t. In order to
extend this struct, outside of the definition for the struct uio, the
struct uio has been embedded inside of a uio_t struct.

Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear
this is a ZFS interface.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #11438
2021-01-20 21:27:30 -08:00
Matthew Ahrens
e2af2acce3
allow callers to allocate and provide the abd_t struct
The `abd_get_offset_*()` routines create an abd_t that references
another abd_t, and doesn't allocate any pages/buffers of its own.  In
some workloads, these routines may be called frequently, to create many
abd_t's representing small pieces of a single large abd_t.  In
particular, the upcoming RAIDZ Expansion project makes heavy use of
these routines.

This commit adds the ability for the caller to allocate and provide the
abd_t struct to a variant of `abd_get_offset_*()`.  This eliminates the
cost of allocating the abd_t and performing the accounting associated
with it (`abdstat_struct_size`).  The RAIDZ/DRAID code uses this for
the `rc_abd`, which references the zio's abd.  The upcoming RAIDZ
Expansion project will leverage this infrastructure to increase
performance of reads post-expansion by around 50%.

Additionally, some of the interfaces around creating and destroying
abd_t's are cleaned up.  Most significantly, the distinction between
`abd_put()` and `abd_free()` is eliminated; all types of abd_t's are
now disposed of with `abd_free()`.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #8853 
Closes #11439
2021-01-20 11:24:37 -08:00
Matthew Ahrens
2ac90457f5
record ioctl elapsed time in zpool history
Each zfs ioctl that changes on-disk state (e.g. set property, create
snapshot, destroy filesystem) is recorded in the zpool history, and is
printed by `zpool history -i`.

For performance diagnostic purposes, it would be useful to know how long
each of these ioctls took to run.  This commit adds that functionality,
with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist.

Additionally, the time recorded in this history log is currently the
time that the history record is written to disk.  But in many cases (CLI
args logging and ioctl logging), this happens asynchronously,
potentially many seconds after the operation completed.  This commit
changes the timestamp to reflect when the history event was created,
rather than when it was written to disk.

Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11440
2021-01-11 09:29:25 -08:00
Matthew Ahrens
dc303dcf5b
assertion failed in arc_wait_for_eviction()
If the system is very low on memory (specifically,
`arc_free_memory() < arc_sys_free/2`, i.e. less than 1/16th of RAM
free), `arc_evict_state_impl()` will defer wakups.  In this case, the
arc_evict_waiter_t's remain on the list, even though `arc_evict_count`
has been incremented past their `aew_count`.

The problem is that `arc_wait_for_eviction()` assumes that if there are
waiters on the list, the count they are waiting for has not yet been
reached.  However, the deferred wakeups may violate this, causing
`ASSERT(last->aew_count > arc_evict_count)` to fail.

This commit resolves the issue by having new waiters use the greater of
`arc_evict_count` and the last `aew_count`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11285
Closes #11397
2021-01-07 20:06:32 -08:00
Toomas Soome
40ab927ae8
implicit conversion from 'boolean_t' to 'ds_hold_flags_t'
Build error on illumos with gcc 10 did reveal:

In function 'dmu_objset_refresh_ownership':
../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion
from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'}
[-Werror=enum-conversion]
      857 |  dsl_dataset_disown(ds, decrypt, tag);
          |                         ^~~~~~~
cc1: all warnings being treated as errors

libzfs_input_check.c: In function 'zfs_ioc_input_tests':
libzfs_input_check.c:754:28: error: implicit conversion from
'enum dmu_objset_type' to 'enum lzc_dataset_type'
[-Werror=enum-conversion]
  754 |  err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0);
      |                            ^~~~~~~~~~~
cc1: all warnings being treated as errors

The same issue is present in openzfs, and also the same issue about
ds_hold_flags_t, which currently defines exactly one valid value.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #11406
2020-12-27 16:31:02 -08:00
Brian Behlendorf
0c763f76b1
Remove unused check from dmu_tx_count_write()
Individual transactions may not be larger than DMU_MAX_ACCESS.
This is enforced by the assertions in dmu_tx_hold_write() and
dmu_tx_hold_write_by_dnode().  There's an additional check in
dmu_tx_count_write() however it has no effect and only sets a
local err variable.  We could enable this check, however since
it's already enforced by ASSERTs elsewhere I opted to remove it
instead.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3731 
Closes #11384
2020-12-21 20:17:13 -08:00
Andy Fiddaman
39372fa25b
Dangling reference from dmu_objset_upgrade
After porting the fix for https://github.com/openzfs/zfs/issues/5295
over to illumos, we started hitting an assertion failure when running
the testsuite:

	assertion failed: rc->rc_count == number, file: .../refcount.c

and the unexpected hold has this stack:

	dsl_dataset_long_hold+0x59 dmu_objset_upgrade+0x73
dmu_objset_id_quota_upgrade+0x15 dmu_objset_own+0x14f

The simplest reproducer for this in illumos is

    zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool

which is run as part of the zpool_create_tempname test, but I can't get
this to trigger on FreeBSD. This appears to be because of the call to
txg_wait_synced() in dmu_objset_upgrade_stop() (which was missing in
illumos), slows down dmu_objset_disown() enough to avoid the condition.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andy Fiddaman <andy@omnios.org>
Closes #11368
2020-12-21 10:13:23 -08:00
Christian Schwarz
49c482fde3
dsl_pool: extend comment on DSL Pool Configuration Lock
Based on a conversation with Matt on the OpenZFS Slack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11370
2020-12-19 18:04:05 -08:00
Brian Behlendorf
1c2358c12a
Linux 5.10 compat: use iov_iter in uio structure
As of the 5.10 kernel the generic splice compatibility code has been
removed.  All filesystems are now responsible for registering a
->splice_read and ->splice_write callback to support this operation.

The good news is the VFS provided generic_file_splice_read() and
iter_file_splice_write() callbacks can be used provided the ->iter_read
and ->iter_write callback support pipes.  However, this is currently
not the case and only iovecs and bvecs (not pipes) are ever attached
to the uio structure.

This commit changes that by allowing full iov_iter structures to be
attached to uios.  Ever since the 4.9 kernel the iov_iter structure
has supported iovecs, kvecs, bvevs, and pipes so it's desirable to
pass the entire thing when possible.  In conjunction with this the
uio helper functions (i.e uiomove(), uiocopy(), etc) have been
updated to understand the new UIO_ITER type.

Note that using the kernel provided uio_iter interfaces allowed the
existing Linux specific uio handling code to be simplified.  When
there's no longer a need to support kernel's older than 4.9, then
it will be possible to remove the iovec and bvec members from the
uio structure and always use a uio_iter.  Until then we need to
maintain all of the existing types for older kernels.

Some additional refactoring and cleanup was included in this change:

- Added checks to configure to detect available iov_iter interfaces.
  Some are available all the way back to the 3.10 kernel and are used
  when available.  In particular, uio_prefaultpages() now always uses
  iov_iter_fault_in_readable() which is available for all supported
  kernels.

- The unused UIO_USERISPACE type has been removed.  It is no longer
  needed now that the uio_seg enum is platform specific.

- Moved zfs_uio.c from the zcommon.ko module to the Linux specific
  platform code for the zfs.ko module.  This gets it out of libzfs
  where it was never needed and keeps this Linux specific code out
  of the common sources.

- Removed unnecessary O_APPEND handling from zfs_iter_write(), this
  is redundant and O_APPEND is already handled in zfs_write();

Reviewed-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11351
2020-12-18 08:48:26 -08:00
Matthew Ahrens
71e4ce0e52
special device removal space accounting fixes
The space in special devices is not included in spa_dspace (or
dsl_pool_adjustedsize(), or the zfs `available` property).  Therefore
there is always at least as much free space in the normal class, as
there is allocated in the special class(es).  And therefore, there is
always enough free space to remove a special device.

However, the checks for free space when removing special devices did not
take this into account.  This commit corrects that.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11329
2020-12-17 12:11:56 -08:00
Ryan Moeller
1531506d23
Avoid extra work updating ARC kstats and tunables
After e357046 it should not be necessary to periodically update ARC
kstats and tunables.  Tunable updates are applied when modified, and
kstats are updated on demand.

Update kstats in `arc_evict_cb_check()` for `ZFS_DEBUG` builds only.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11237
2020-12-17 11:16:42 -08:00
Matthew Ahrens
be5c6d9653
Only examine best metaslabs on each vdev
On a system with very high fragmentation, we may need to do lots of gang
allocations (e.g. most indirect block allocations (~50KB) may need to
gang). Before failing a "normal" allocation and resorting to ganging, we
try every metaslab.  This has the impact of loading every metaslab (not
a huge deal since we now typically keep all metaslabs loaded), and also
iterating over every metaslab for every failing allocation. If there are
many metaslabs (more than the typical ~200, e.g. due to vdev expansion
or very large vdevs), the CPU cost of this iteration can be very
impactful.  This iteration is done with the mg_lock held, creating long
hold times and high lock contention for concurrent allocations,
ultimately causing long txg sync times and poor application performance.

To address this, this commit changes the behavior of "normal" (not
try_hard, not ZIL) allocations.  These will now only examine the 100
best metaslabs (as determined by their ms_weight).  If none of these
have a large enough free segment, then the allocation will fail and
we'll fall back on ganging.

To accomplish this, we will now (normally) gang before doing a
`try_hard` allocation.  Non-try_hard allocations will only examine the
100 best metaslabs of each vdev.  In summary, we will first try normal
allocation.  If that fails then we will do a gang allocation.  If that
fails then we will do a "try hard" gang allocation.  If that fails then
we will have a multi-layer gang block.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11327
2020-12-16 14:40:05 -08:00
Alexander Motin
f8020c9363
Make metaslab class rotor and aliquot per-allocator.
Metaslab rotor and aliquot are used to distribute workload between
vdevs while keeping some locality for logically adjacent blocks.  Once
multiple allocators were introduced to separate allocation of different
objects it does not make much sense for different allocators to write
into different metaslabs of the same metaslab group (vdev) same time,
competing for its resources.  This change makes each allocator choose
metaslab group independently, colliding with others only sporadically.

Test including simultaneous write into 4 files with recordsize of 4KB
on a striped pool of 30 disks on a system with 40 logical cores show
reduction of vdev queue lock contention from 54 to 27% due to better
load distribution.  Unfortunately it won't help much ZVOLs yet since
only one dataset/ZVOL is synced at a time, and so for the most part
only one allocator is used, but it may improve later.

While there, to reduce the number of pointer dereferences change
per-allocator storage for metaslab classes and groups from several
separate malloc()'s to variable length arrays at the ends of the
original class and group structures.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11288
2020-12-15 10:55:44 -08:00
Matthew Macy
923d730329
dmu_zfetch: fix memory leak
The last change caused the read completion callback to not be called
if the IO was still in progress. This change restores allocation
of the arc buf callback, but in the callback path checks the new
acb_nobuf field to know to skip buffer allocation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11324
2020-12-12 16:00:00 -08:00
George Amanakis
c76a40bfda
Fix reporting of CKSUM errors in indirect vdevs
When removing and subsequently reattaching a vdev, CKSUM errors may
occur as vdev_indirect_read_all() reads from all children of a mirror
in case of a resilver.

Fix this by checking whether a child is missing the data and setting a
flag (ic_error) which is then checked in vdev_indirect_repair() and
suppresses incrementing the checksum counter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #11277
2020-12-11 12:15:37 -08:00
Matthew Ahrens
ba67d82142
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 10:26:02 -08:00
Paul Dagnelie
7d4b365ce3
Fix kernel panic induced by redacted send
In the redaction list traversal code, there is a bug in the binary search
logic when looking for the resume point. Maxbufid can be decremented to -1,
causing us to read the last possible block of the object instead of the one we
wanted. This can cause incorrect resume behavior, or possibly even a hang in
some cases. In addition, when examining non-last blocks, we can treat the
block as being the same size as the last block, causing us to miss entries in
the redaction list when determining where to resume. Finally, we were ignoring
the case where the resume point was found in the buffer being searched, and
resuming from minbufid. All these issues have been corrected, and the code has
been significantly simplified to make future issues less likely.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11297
2020-12-11 10:22:29 -08:00
Paul Dagnelie
60a4c7d2a2
Implement memory and CPU hotplug
ZFS currently doesn't react to hotplugging cpu or memory into the 
system in any way. This patch changes that by adding logic to the ARC 
that allows the system to take advantage of new memory that is added 
for caching purposes. It also adds logic to the taskq infrastructure 
to support dynamically expanding the number of threads allocated to a 
taskq.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11212
2020-12-10 14:09:23 -08:00
Matthew Macy
1e4732cbda
Decouple arc_read_done callback from arc buf instantiation
Add ARC_FLAG_NO_BUF to indicate that a buffer need not be
instantiated.  This fixes a ~20% performance regression on
cached reads due to zfetch changes.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11220 
Closes #11232
2020-12-09 15:05:06 -08:00
Brian Behlendorf
edb20ff3ba
Fix optional "force" arg handing in zfs_ioc_pool_sync()
The fnvlist_lookup_boolean_value() function should not be used
to check the force argument since it's optional.  It may not be
provided or may have been created with the wrong flags.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11281
Closes #11284
2020-12-09 14:52:45 -08:00
Brian Behlendorf
83b698dc42
Reduce fletcher4 and raidz benchmark times
During module load time all of the available fetcher4 and raidz
implementations are benchmarked for a fixed amount of time to
determine the fastest available.  Manual testing has shown that this
time can be significantly reduced with negligible effect on the final
results.

This commit changes the benchmark time to 1ms which can reduce the
module load time by over a second on x86_64.  On an x86_64 system
with sse3, ssse3, and avx2 instructions the benchmark times are:

    Fletcher4    603ms   -> 15ms
    RAIDZ        1,322ms -> 64ms

Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11282
2020-12-06 09:57:20 -08:00
Alexander Motin
8136b9d73b
Avoid some spa_has_pending_synctask() calls.
Since 8c4fb36a24 (PR #7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11280
2020-12-06 09:55:02 -08:00
George Amanakis
d1d47691c2
Fix raw sends on encrypted datasets when copying back snapshots
When sending raw encrypted datasets the user space accounting is present
when it's not expected to be. This leads to the subsequent mount failure
due a checksum error when verifying the local mac.
Fix this by clearing the OBJSET_FLAG_USERACCOUNTING_COMPLETE and reset
the local mac. This allows the user accounting to be correctly updated
on first mount using the normal upgrade process.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10523 
Closes #11221
2020-12-04 14:34:29 -08:00
Alexander Motin
dcf7044522
Fix for "Reduce latency effects of non-interactive I/O"
It was found that setting min_active tunables for non-interactive I/Os
makes them stuck.  It is caused by zfs_vdev_nia_delay, that can never
be reached if we never issue any I/Os due to min_active set to zero.

Fix this by issuing at least one non-interactive I/O at a time when
there are no interactive I/Os.  When there are interactive I/Os, zero
min_active allows to completely block any non-interactive I/O.  It may
min_active starvation in some scenarios, but who we are to deny foot
shooting?

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11261
2020-12-03 10:02:39 -08:00
Ryan Moeller
0aacde2e9a
FreeBSD: notify userspace when a vdev is removed
This is needed for zfsd to autoreplace vdevs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11260
2020-12-02 10:20:02 -08:00
Finix1979
ec50cd24ba
Avoid unneccessary zio allocation and wait
In function dmu_buf_hold_array_by_dnode, the usage of zio is only for 
the reading operation. Only create the zio and wait it in the reading 
scenario as a performance optimization.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Finix Yan <yancw@info2soft.com>
Closes #11251 
Closes #11256
2020-12-02 09:28:55 -08:00
Brian Behlendorf
04a82e043d
Remove incorrect assertion
Commit 85703f6 added a new ASSERT to zfs_write() as part of the
cleanup which isn't correct in the case where multiple processes
are concurrently extending a file.  The `zp->z_size` is updated
atomically while holding a range lock on only a portion of the
file.  Therefore, it's possible for the file size to increase
after a same check is performed earlier in the loop causing this
ASSERT to fail.  The code itself handles this case correctly so
only the invalid ASSERT needs to be removed.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11235
2020-11-24 09:28:42 -08:00
Alexander Motin
6f5aac3ca0
Reduce latency effects of non-interactive I/O
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
2020-11-24 09:26:42 -08:00
Brian Behlendorf
4d0ba94113
Correct missing zil_claim() DTL updates
Commit a1d477c2 accidentally disabled DTL updates for the zil_claim()
case described at the end of vdev_stat_update() by unconditionally
disabling all DTL updates when loading.  This was done to avoid
a deadlock on the vd_dtl_lock when loading the DTLs from disk.

    vdev_dtl_contains <--- Takes vd->vd_dtl_lock
    vdev_mirror_child_missing
    vdev_mirror_io_start
    zio_vdev_io_start
    __zio_execute
    arc_read
    dbuf_issue_final_prefetch
    dbuf_prefetch_impl
    dbuf_prefetch
    dmu_prefetch
    space_map_iterate
    space_map_load_length
    space_map_load
    vdev_dtl_load <--- Takes vd->vd_dtl_lock
    vdev_load
    spa_ld_load_vdev_metadata
    spa_tryimport

The missing DTL updates can be restored by moving the space_map_load()
call outside the vd_dtl_lock.  A private range tree is populated by
reading the space map and then merged in to the DTL_MISSING tree
under the lock.

Furthermore, the SPA_LOAD_NONE check in vdev_dtl_contains() leads to an
additional problem.  Any resilvering which occurs before SPA_LOAD_NONE
is set will incorrectly determine that there's nothing to repair.  This
can result in full redundancy not being restored for some blocks.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11218
2020-11-20 13:14:45 -08:00
Ryan Moeller
85703f616d
Reduce confusion in zfs_write
Is this block when abuf != NULL ever reached? Yes, it is.

Add asserts and comments to prove that when we get here, we have a full
block write at an aligned offset extending past EOF.

Simplify by removing the check that tx_bytes == max_blksz, since we can
assert that it is always true.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11191
2020-11-18 15:06:59 -08:00
Matthew Macy
0ca45cb310
Fix problems in zvol_set_volmode_impl
- Don't leave fstrans set when passed a snapshot
- Don't remove minor if volmode already matches new value
- (FreeBSD) Wait for GEOM ops to complete before trying
  remove (at create time GEOM will be "tasting" in parallel)
- (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL
- (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11199
2020-11-17 09:50:52 -08:00
loli10K
4072f465bc
Fix 'zfs userspace' for received datasets in encrypted root
For encrypted receives, where user accounting is initially disabled on
creation, both 'zfs userspace' and 'zfs groupspace' fails with
EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to
set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a
successful dmu_objset_space_upgrade().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9501 
Closes #9596
2020-11-16 09:10:29 -08:00
George Amanakis
2c210f6818
Fix ASSERT logic in l2arc_evict()
In case of cache device removal it is possible that at the end of
l2arc_evict() we have l2ad_hand = l2ad_evict. This can lead to the
following panic in case of a debug build:

VERIFY3(dev->l2ad_hand < dev->l2ad_evict) failed (321920512 < 321920512)
Call Trace:
 dump_stack+0x66/0x90
 spl_panic+0xef/0x117 [spl]
 l2arc_remove_vdev+0x11d/0x290 [zfs]
 spa_load_l2cache+0x275/0x5b0 [zfs]
 spa_vdev_remove+0x4a5/0x6e0 [zfs]
 zfs_ioc_vdev_remove+0x59/0xa0 [zfs]
 zfsdev_ioctl_common+0x5b3/0x630 [zfs]
 zfsdev_ioctl+0x53/0xe0 [zfs]
 do_vfs_ioctl+0x42e/0x6b0
 ksys_ioctl+0x5e/0x90
 do_syscall_64+0x5b/0x1a0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

In case of cache device removal it also possible that l2ad_hand +
distance > l2ad_end since we do not iterate l2arc_evict() and l2ad_hand
is not reset. This has no functional consequence however as the cache
device is about to be removed.

Fix this by omitting the ASSERT in case of device removal.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #11205
2020-11-16 09:08:11 -08:00
Matthew Ahrens
d66aab7c08
Assertion failure when logging large output of channel program
The output of ZFS channel programs is logged on-disk in the zpool
history, and printed by `zpool history -i`.  Channel programs can use
10MB of memory by default, and up to 100MB by using the `zfs program -m`
flag.  Therefore their output can be up to some fraction of 100MB.

In addition to being somewhat wasteful of the limited space reserved for
the pool history (which for large pools is 1GB), in extreme cases this
can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in
`dmu_buf_hold_array_by_dnode()`.

This commit limits the output size that will be logged to 1MB.  Larger
outputs will not be logged, instead a entry will be logged indicating
the size of the omitted output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11194
2020-11-14 10:17:16 -08:00
Ryan Moeller
7e3617de35
Return EFAULT at the end of zfs_write() when set
FreeBSD's VFS expects EFAULT from zfs_write() if we didn't complete
the full write so it can retry the operation.  Add some missing
SET_ERRORs in zfs_write().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11193
2020-11-14 10:16:26 -08:00
Brian Behlendorf
b2255edcc0
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID.  This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.

A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`.  No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.

    zpool create <pool> draid[1,2,3] <vdevs...>

Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons.  The supported options include:

    zpool create <pool> \
        draid[<parity>][:<data>d][:<children>c][:<spares>s] \
        <vdevs...>

    - draid[parity]       - Parity level (default 1)
    - draid[:<data>d]     - Data devices per group (default 8)
    - draid[:<children>c] - Expected number of child vdevs
    - draid[:<spares>s]   - Distributed hot spares (default 0)

Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.

```
  pool: tank
 state: ONLINE
config:

    NAME                  STATE     READ WRITE CKSUM
    slag7                 ONLINE       0     0     0
      draid2:8d:68c:2s-0  ONLINE       0     0     0
        L0                ONLINE       0     0     0
        L1                ONLINE       0     0     0
        ...
        U25               ONLINE       0     0     0
        U26               ONLINE       0     0     0
        spare-53          ONLINE       0     0     0
          U27             ONLINE       0     0     0
          draid2-0-0      ONLINE       0     0     0
        U28               ONLINE       0     0     0
        U29               ONLINE       0     0     0
        ...
        U42               ONLINE       0     0     0
        U43               ONLINE       0     0     0
    special
      mirror-1            ONLINE       0     0     0
        L5                ONLINE       0     0     0
        U5                ONLINE       0     0     0
      mirror-2            ONLINE       0     0     0
        L6                ONLINE       0     0     0
        U6                ONLINE       0     0     0
    spares
      draid2-0-0          INUSE     currently in use
      draid2-0-1          AVAIL
```

When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command.  These options are leverages
by zloop.sh to test a wide range of dRAID configurations.

    -K draid|raidz|random - kind of RAID to test
    -D <value>            - dRAID data drives per group
    -S <value>            - dRAID distributed hot spares
    -R <value>            - RAID parity (raidz or dRAID)

The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.

Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-13 13:51:51 -08:00
Matthew Ahrens
a724db0374
Channel program may spuriously fail with "memory limit exhausted"
ZFS channel programs (invoked by `zfs program`) are executed in a LUA
sandbox with a limit on the amount of memory they can consume.  The
limit is 10MB by default, and can be raised to 100MB with the `-m` flag.
If the memory limit is exceeded, the LUA program exits and the command
fails with a message like `Channel program execution failed: Memory
limit exhausted.`

The LUA sandbox allocates memory with `vmem_alloc(KM_NOSLEEP)`, which
will fail if the requested memory is not immediately available.  In this
case, the program fails with the same message, `Memory limit exhausted`.
However, in this case the specified memory limit has not been reached,
and the memory may only be temporarily unavailable.

This commit changes the LUA memory allocator `zcp_lua_alloc()` to use
`vmem_alloc(KM_SLEEP)`, so that we won't spuriously fail when memory is
temporarily low.  Instead, we rely on the system to be able to free up
memory (e.g. by evicting from the ARC), and we assume that even at the
highest memory limit of 100MB, the channel program will not truly
exhaust the system's memory.

External-issue: DLPX-71924
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11190
2020-11-11 17:16:15 -08:00
Mateusz Guzik
18ca574f0a
G/C data_alloc_arena
It is a leftover from illumos always set to NULL and introducing a
spurious difference between zio_buf and zio_data_buf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11188
2020-11-11 17:11:32 -08:00
Ryan Moeller
d1dd72a2c5 Simplify offset and length limit in zfs_write
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11176
2020-11-10 10:58:59 -08:00
Ryan Moeller
9a764716fc Const some unchanging variables in zfs_write
Show that these values will not be changing later.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11176
2020-11-10 10:58:59 -08:00
Ryan Moeller
8a9634e2f3 Remove redundant oid parameter to update_pages
The oid comes from the znode we are already passing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11176
2020-11-10 10:54:30 -08:00
Ryan Moeller
eec6646ea9 Factor uid, gid, and projid out of loop in zfs_write
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11176
2020-11-10 10:53:19 -08:00
Alexander Motin
daabddaac1
Fix dmu_tx_dirty_throttle after arc_c reduction
After initial arc_c was reduced to arc_c_min it became possible that
on datasets with primarycache=metadata or none dirty data make up most
of ARC capacity and easily more than configured 50% of initial arc_c,
that causes forced txg commits by arc_tempreserve_space() and periodic
very long write delays.

This patch makes arc_tempreserve_space() to use arc_c only after ARC
warmed up once and arc_c really means something, but use arc_c_max
before that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11178
2020-11-10 10:39:26 -08:00
Matthew Macy
570d7038d0
Fix dnode refcount tracking
Fix a couple of places where the wrong tag is passed
to dnode_{hold, rele}

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11184
2020-11-10 10:37:10 -08:00
Christian Schwarz
ab8c935ea6
zfs_vnops: make zfs_get_data OS-independent
Move zfs_get_data() in to platform-independent code. The only
platform-specific aspect of it is the way we release an inode 
(Linux) / vnode_t (FreeBSD). I am not aware of a platform that
could be supported by ZFS that couldn't implement zfs_rele_async 
itself. It's sibling zvol_get_data already is platform-independent.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #10979
2020-11-02 12:07:07 -08:00
Mateusz Guzik
09eb36ce3d
Introduce CPU_SEQID_UNSTABLE
Current CPU_SEQID users don't care about possibly changing CPU ID, but
enclose it within kpreempt disable/enable in order to fend off warnings
from Linux's CONFIG_DEBUG_PREEMPT.

There is no need to do it. The expected way to get CPU ID while allowing
for migration is to use raw_smp_processor_id.

In order to make this future-proof this patch keeps CPU_SEQID as is and
introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers
explicitly want this behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11142
2020-11-02 11:51:12 -08:00
Matthew Macy
8583540c6e
Consolidate zfs_holey and zfs_access
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11125
2020-10-31 09:40:08 -07:00
Matthew Macy
5fa356ea44
Remove UIO_ZEROCOPY functions structures
The original xuio zero copy functionality has always been unused 
on Linux and FreeBSD.  Remove this disabled code to avoid any
confusion and improve readability.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11124
2020-10-30 10:00:33 -07:00
Alexander Motin
1199c3e8fb
Yield periodically when rebuilding L2ARC
L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Reviewed-by: Cedric Berger <cedric@precidata.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11116
2020-10-30 08:57:54 -07:00
Ryan Moeller
76d04993a6
Update references to nonexistent man pages in code
Refer to the correct section or alternative for FreeBSD and Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11132
2020-10-30 08:55:59 -07:00
Ryan Moeller
eb02a4c6fb
Add missing zfs_arc_evict_batch_limit tunable
It's even documented already.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11094
2020-10-22 10:18:26 -07:00
Matthew Macy
e53d678d4a
Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD.  With a little refactoring they can be
moved to the common code which is what is done by this commit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11078
2020-10-21 14:08:06 -07:00
Adam D. Moss
666aa69f32
Non-l2arc pool reads shouldn't be l2arc misses
The current l2_misses accounting behavior treats all reads to pools 
without a configured l2arc as an l2arc miss, IFF there is at least 
one other pool on the system which does have an l2arc configured.

This makes it extremely hard to tune for an improved l2arc hit/miss 
ratio because this ratio will be modulated by reads from pools which 
do not (and should not) have l2arc devices; its upper limit will 
depend on the ratio of reads from l2arc'd pools and non-l2arc'd pools.

This PR prevents ARC reads affecting l2arc stats (n.b. l2_misses is 
the only relevant one) where the target spa doesn't have an l2arc.

Includes new test - l2arc_l2miss_pos.ksh

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #10921
2020-10-20 11:39:52 -07:00
Don Brady
dff71c7936
Ignore special vdev ashift for spa ashift min/max
The removal of a vdev in the normal class would fail if there was a 
special or deup vdev that had a different ashift than the vdevs in 
the normal class.

Moved the initialization of spa_min_ashift / spa_max_ashift from 
vdev_open so that it occurs after the vdev allocation bias was 
initialized (i.e. after vdev_load).

Caveat -- In order to remove a special/dedup vdev it must have the 
same ashift as the normal pool vdevs.  This could perhaps be lifted 
in the future (i.e. for the case where there is ample space in any 
surviving special class vdevs)

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #9363
Closes #9364
Closes #11053
2020-10-15 14:45:16 -07:00
Christian Schwarz
15a4ca4620
Fix crash caused by invalid snapshot names in redactnvl
This is a follow up fix for commit 0fdd6106bb.  The VERIFY is
only true when we haven't hit an error code path.  See added
test case for a reproducer.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11048
2020-10-14 14:04:19 -07:00
Paul Dagnelie
6a60ef80e2
Fix incorrect deletion order in range_tree_add_impl gap case
After a side-effectful call like add or remove, references to range 
segs stored in btrees can no longer be used safely.  We move the 
remove call to just before the reinsertion call so that the seg 
remains valid for as long as we need it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11044 
Closes #11056
2020-10-14 08:59:54 -07:00
Matthew Macy
57dc5d42b1
dmu_zfetch: don't leak unreferenced stream when zfetch is freed
Currently streams are only freed when:
  - They have no referencing zfetch and and their I/O references
    go to zero.
  - They are more than 2s old and a new I/O request comes in on
    the same zfetch.

This means that we will leak unreferenced streams when their zfetch
structure is freed.

This change checks the reference count on a stream at zfetch free
time. If it is zero we free it immediately. If it has remaining
references we allow the prefetch callback to free it at I/O
completion time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11052
2020-10-13 21:03:36 -07:00
Ryan Moeller
7dfc56d866
Expose zfetch_max_idistance tunable
FreeBSD had this value tunable before the switch to the new OpenZFS.
The tunable name has changed, breaking legacy compat.

Restore legacy compat for this tunable, properly expose the tunable
with the new name on all platforms, and document it in
zfs-module-parameters(5).

While here, clean up the documentation for zfetch_max_distance a bit.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11038
2020-10-13 09:32:34 -07:00
Christian Schwarz
61868bb14d
zil_parse: make callback parameters const
Code cleanup, a follow up commit to 4d55ea81.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11020
2020-10-09 09:34:54 -07:00
Brian Behlendorf
d0249a4bd0
Replace ZFS on Linux references with OpenZFS
This change updates the documentation to refer to the project
as OpenZFS instead ZFS on Linux.  Web links have been updated
to refer to https://github.com/openzfs/zfs.  The extraneous
zfsonlinux.org web links in the ZED and SPL sources have been
dropped.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11007
2020-10-08 20:10:13 -07:00
Chuck Tuffli
a8fc1b8743
Fix ubsan: shift exponent is too large
When running libzpool with the Undefined Behavior Sanitizer (ubsan)
enabled, a zpool create causes a run-time error:

    module/zfs/vdev_label.c:600:14: runtime error: shift exponent 64 is
    too large for 64-bit type 'long long unsigned int'`

in vdev_config_generate()

Fix is to convert vdev_removal_max_span to its base-2 logarithm, using
highbit64(), and then compare the "shifts".

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chuck Tuffli <ctuffli@gmail.com>
Closes #9744
Closes #11024
2020-10-08 16:37:27 -07:00
George Amanakis
a76e4e6761
Make L2ARC tests more robust
Instead of relying on arbitrary timers after pool export/import or cache
device off/online rely on arcstats. This makes the L2ARC tests more
robust. Also cleanup some functions related to persistent L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10983
2020-10-05 15:29:05 -07:00
Ryan Moeller
4d55ea811d
Throw const on some strings
In C, const indicates to the reader that mutation will not occur.
It can also serve as a hint about ownership.

Add const in a few places where it makes sense.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10997
2020-10-02 17:44:10 -07:00
John Poduska
5b525165e9
Mismatched nvlist names in zfs_keys_send_space
This causes "zfs send -vt ..." to fail with:

    cannot resume send: Unknown error 1030

It turns out that some of the name/value pairs in the verification
list for zfs_ioc_send_space(), zfs_keys_send_space, had the wrong
name, so the ioctl got kicked out in zfs_check_input_nvpairs().
Update the names accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #10978
2020-10-02 17:40:46 -07:00
Matthew Macy
1cb8202b1b
Eliminate gratuitous bzeroing in dbuf_stats_hash_table_data
`dbuf_stats_hash_table_data` can take much longer than it needs to
by repeatedly bzeroing its buffer when in fact the buffer only needs
to be NULL terminated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10993
2020-09-30 13:24:38 -07:00
Sebastian Gottschall
8a171ccd92
do a cyclic seek for unused memory objects in pool
In non regular use cases allocated memory might stay persistent in memory
pool. This small patch checks every minute if there are old objects which
can be released from memory pool.

Right now with regular use, the pool is checked for old objects on each
allocation attempt from this pool. so basically polling by its use. Now
consider what happens if someone writes a lot of files and stops use of
the volume or even unmounts it. So the code will no longer check if
objects can be released from the pool. Already allocated objects will
still stay in pool cache. this is no big issue for common use. But
someone discovered this issue while doing tests. personally i know this
behavior and I'm aware of it. Its no big issue. just a enhancement

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10938 
Closes #10969
2020-09-30 13:22:34 -07:00
Ryan Moeller
c0bd2e0fe2
Drop references when skipping dmu_send due to EXDEV
When an invalid incremental send is requested where the "to" ds is
before the "from" ds, make sure to drop the reference to the pool
and the dataset before returning the error.

Add an assert on FreeBSD to make sure we don't hold any locks after
returning from an ioctl.

Add some test coverage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10919
2020-09-30 13:19:49 -07:00
Matthew Macy
af20b97078
zfetch: Don't issue new streams when old have not completed
The current dmu_zfetch code implicitly assumes that I/Os complete
within min_sec_reap seconds. With async dmu and a readonly workload
(and thus no exponential backoff in operations from the "write
throttle") such as L2ARC rebuild it is possible to saturate the drives
with I/O requests. These are then effectively compounded with prefetch
requests.

This change reference counts streams and prevents them from being
recycled after their min_sec_reap timeout if they still have
outstanding I/Os.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10900
2020-09-27 17:08:38 -07:00
Adam D. Moss
acfd2d4641
Add DB_RF_NOPREFETCH to dbuf_read()s in dnode.c
Prefetching of dnodes in dbuf_read() can cause significant mutex 
contention for some workloads and isn't very helpful.  This is  
because we already get 32 dnodes for each block read, and when 
iterating over a directory we prefetch the dnodes in the directory.
Disable this prefetching to prevent the lock contention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Submitted-by: Adam Moss <c@yotes.com>
Submitted-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #10877 
Closes #10953
2020-09-25 13:49:22 -07:00
Ryan Moeller
863e38453e Prune dead branch reported by Coverity
wkey is NULL at every `goto error;`.
dcp is never NULL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10884
2020-09-25 13:11:53 -07:00
Christian Schwarz
a5c77dc4d5
zfs_log_write: simplify data copying code for WR_COPIED records
lr_write_t records that are WR_COPIED have the record data directly
appended to them (see lr_write_t type definition).

The data is copied from the debuf using dmu_read_by_dnode.

This function was called, only for WR_COPIED records, as part of a
short-circuiting if-statement's if-expression.

I found this side-effectful call to dmu_read_by_dnode pretty
hard to spot.
This patch improves readability by moving the call to its own line.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #10956
2020-09-25 13:06:34 -07:00
Matthew Macy
7b8363d7f0
FreeBSD: Add support for procfs_list
The procfs_list interface is required by several kstats. Implement
this functionality for FreeBSD to provide access to these kstats.
                           
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10890
2020-09-23 16:43:51 -07:00
Paul Dagnelie
20dfe8cd3b
Don't set numobjs to UINT64_MAX or near it
Resolves an issue with `zfs send` streams from 0.8.4 which
prevents them from being received by versions < 0.7.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10911 
Closes #10916
2020-09-22 16:16:07 -07:00
George Amanakis
c6f5e9d92f
Restore clearing of L2CACHE flag in arc_read_done()
Commit 45152dc removed clearing of L2CACHE flag in arc_read_done() and
moved related code in l2arc_write_eligible(). After careful code
inspection arc_read_done() is not bypassed in the case of prefetches.
Thus restore the old behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: adam moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10951
2020-09-22 16:08:05 -07:00
George Wilson
c494aa7f57
vdev_ashift should only be set once
== Motivation and Context

The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.

In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.

== Description

This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes #10932
2020-09-18 12:13:47 -07:00
Pavel Snajdr
9569c31161 Fix stack frame size: dnode_dirty_l1range()
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:55 -07:00
Pavel Snajdr
a1c5578ce0 dmu_redact_snap: fix possible memleak
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:45 -07:00
Pavel Snajdr
8c0b16e6e9 Fix stack frame size: dmu_redact_snap()
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:35 -07:00
Pavel Snajdr
c95625769d Fix stack frame size: spa_livelist_delete_cb()
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:03 -07:00
Toomas Soome
1db9e6e4e4
zfs label bootenv should store data as nvlist
nvlist does allow us to support different data types and systems.

To encapsulate user data to/from nvlist, the libzfsbootenv library is
provided.

Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #10774
2020-09-15 15:42:27 -07:00