The RAIDZ and DRAID code is responsible for reporting checksum errors on
their child vdevs. Checksum errors represent events where a disk
returned data or parity that should have been correct, but was not. In
other words, these are instances of silent data corruption. The
checksum errors show up in the vdev stats (and thus `zpool status`'s
CKSUM column), and in the event log (`zpool events`).
Note, this is in contrast with the more common "noisy" errors where a
disk goes offline, in which case ZFS knows that the disk is bad and
doesn't try to read it, or the device returns an error on the requested
read or write operation.
RAIDZ/DRAID generate checksum errors via three code paths:
1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
reported on any children whose data was not used during the
reconstruction. This is handled in `raidz_reconstruct()`. This is the
most common type of RAIDZ/DRAID checksum error.
2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
means that the data has been lost. The zio fails and an error is
returned to the consumer (e.g. the read(2) system call). This would
happen if, for example, three different disks in a RAIDZ2 group are
silently damaged. Since the damage is silent, it isn't possible to know
which three disks are damaged, so a checksum error is reported against
every child that returned data or parity for this read. (For DRAID,
typically only one "group" of children is involved in each io.) This
case is handled in `vdev_raidz_cksum_finish()`. This is the next most
common type of RAIDZ/DRAID checksum error.
3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
case 2), but there happens to be additional copies of this block due to
"ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
copies is good, then RAIDZ/DRAID compares each sector of the data or
parity that it retrieved with the good data from the other DVA, and if
they differ then it reports a checksum error on this child. This
differs from case 2 in that the checksum error is reported on only the
subset of children that actually have bad data or parity. This case
happens very rarely, since normally only metadata has ditto blocks. If
the silent damage is extensive, there will be many instances of case 2,
and the pool will likely be unrecoverable.
The code for handling case 3 is considerably more complicated than the
other cases, for two reasons:
1. It needs to run after the main raidz read logic has completed. The
data RAIDZ read needs to be preserved until after the alternate DVA has
been read, which necessitates refcounts and callbacks managed by the
non-raidz-specific zio layer.
2. It's nontrivial to map the sections of data read by RAIDZ to the
correct data. For example, the correct data does not include the parity
information, so the parity must be recalculated based on the correct
data, and then compared to the parity that was read from the RAIDZ
children.
Due to the complexity of case 3, the rareness of hitting it, and the
minimal benefit it provides above case 2, this commit removes the code
for case 3. These types of errors will now be handled the same as case
2, i.e. the checksum error will be reported against all children that
returned data or parity.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11735
The `abd_get_offset_*()` routines create an abd_t that references
another abd_t, and doesn't allocate any pages/buffers of its own. In
some workloads, these routines may be called frequently, to create many
abd_t's representing small pieces of a single large abd_t. In
particular, the upcoming RAIDZ Expansion project makes heavy use of
these routines.
This commit adds the ability for the caller to allocate and provide the
abd_t struct to a variant of `abd_get_offset_*()`. This eliminates the
cost of allocating the abd_t and performing the accounting associated
with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for
the `rc_abd`, which references the zio's abd. The upcoming RAIDZ
Expansion project will leverage this infrastructure to increase
performance of reads post-expansion by around 50%.
Additionally, some of the interfaces around creating and destroying
abd_t's are cleaned up. Most significantly, the distinction between
`abd_put()` and `abd_free()` is eliminated; all types of abd_t's are
now disposed of with `abd_free()`.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #8853Closes#11439
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10102
Provide a common zfs_file_* interface which can be implemented on all
platforms to perform normal file access from either the kernel module
or the libzpool library.
This allows all non-portable vnode_t usage in the common code to be
replaced by the new portable zfs_file_t. The associated vnode and
kobj compatibility functions, types, and macros have been removed
from the SPL. Moving forward, vnodes should only be used in platform
specific code when provided by the native operating system.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9556
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4547Closes#5503Closes#5523Closes#6377Closes#6495
The proposed debugging enhancements in zfsonlinux/spl#587
identified the following missing *_destroy/*_fini calls.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5428
Porting notes:
- Several direct callers of zk_thread_create() are passing TS_RUN for the
length. The `len` and `state` were inverted,this commit fixes them.
Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru
OpenZFS-issue: https://www.illumos.org/issues/6871
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8fc9228Closes#5621
Enable picky cstyle checks and resolve the new warnings. The vast
majority of the changes needed were to handle minor issues with
whitespace formatting. This patch contains no functional changes.
Non-whitespace changes are as follows:
* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5465
Enable vectorized raidz code on ABD buffers. The avx512f,
avx512bw, neon and aarch64_neonx2 are disabled in this commit.
With the exception of avx512bw these implementations are
updated for ABD in the subsequent commits.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
When timeout is specified (-t), stop worker threads in the middle of work units.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #5180Closes#5190
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4328