Distributed Spare (dRAID) Feature

This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID.  This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.

A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`.  No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.

    zpool create <pool> draid[1,2,3] <vdevs...>

Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons.  The supported options include:

    zpool create <pool> \
        draid[<parity>][:<data>d][:<children>c][:<spares>s] \
        <vdevs...>

    - draid[parity]       - Parity level (default 1)
    - draid[:<data>d]     - Data devices per group (default 8)
    - draid[:<children>c] - Expected number of child vdevs
    - draid[:<spares>s]   - Distributed hot spares (default 0)

Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.

```
  pool: tank
 state: ONLINE
config:

    NAME                  STATE     READ WRITE CKSUM
    slag7                 ONLINE       0     0     0
      draid2:8d:68c:2s-0  ONLINE       0     0     0
        L0                ONLINE       0     0     0
        L1                ONLINE       0     0     0
        ...
        U25               ONLINE       0     0     0
        U26               ONLINE       0     0     0
        spare-53          ONLINE       0     0     0
          U27             ONLINE       0     0     0
          draid2-0-0      ONLINE       0     0     0
        U28               ONLINE       0     0     0
        U29               ONLINE       0     0     0
        ...
        U42               ONLINE       0     0     0
        U43               ONLINE       0     0     0
    special
      mirror-1            ONLINE       0     0     0
        L5                ONLINE       0     0     0
        U5                ONLINE       0     0     0
      mirror-2            ONLINE       0     0     0
        L6                ONLINE       0     0     0
        U6                ONLINE       0     0     0
    spares
      draid2-0-0          INUSE     currently in use
      draid2-0-1          AVAIL
```

When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command.  These options are leverages
by zloop.sh to test a wide range of dRAID configurations.

    -K draid|raidz|random - kind of RAID to test
    -D <value>            - dRAID data drives per group
    -S <value>            - dRAID distributed hot spares
    -R <value>            - RAID parity (raidz or dRAID)

The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.

Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
This commit is contained in:
Brian Behlendorf
2020-11-13 13:51:51 -08:00
committed by GitHub
parent a724db0374
commit b2255edcc0
153 changed files with 10203 additions and 1882 deletions
+9
View File
@@ -61,6 +61,11 @@ during testing.
.IP
Size of data for raidz block. Size is 1 << (zio_size_shift).
.HP
.BI "\-r" " reflow_offset" " (default: uint max)"
.IP
Set raidz expansion offset. The expanded raidz map allocation function will
produce different map configurations depending on this value.
.HP
.BI "\-S(weep)"
.IP
Sweep parameter space while verifying the raidz implementations. This option
@@ -77,6 +82,10 @@ This options starts the benchmark mode. All implementations are benchmarked
using increasing per disk data size. Results are given as throughput per disk,
measured in MiB/s.
.HP
.BI "\-e(xpansion)"
.IP
Use expanded raidz map allocation function.
.HP
.BI "\-v(erbose)"
.IP
Increase verbosity.
+20 -3
View File
@@ -23,6 +23,7 @@
.\" Copyright (c) 2009 Oracle and/or its affiliates. All rights reserved.
.\" Copyright (c) 2009 Michael Gebetsroither <michael.geb@gmx.at>. All rights
.\" reserved.
.\" Copyright (c) 2017, Intel Corporation.
.\"
.TH ZTEST 1 "Aug 24, 2020" OpenZFS
@@ -82,13 +83,29 @@ Used alignment in test.
.IP
Number of mirror copies.
.HP
.BI "\-r" " raidz_disks" " (default: 4)"
.BI "\-r" " raidz_disks / draid_disks" " (default: 4 / 16)"
.IP
Number of raidz disks.
.HP
.BI "\-R" " raidz_parity" " (default: 1)"
.BI "\-R" " raid_parity" " (default: 1)"
.IP
Raidz parity.
Raid parity (raidz & draid).
.HP
.BI "\-K" " raid_kind" " (default: 'random') raidz|draid|random"
.IP
The kind of RAID config to use. With 'random' the kind alternates between raidz and draid.
.HP
.BI "\-D" " draid_data" " (default: 4)"
.IP
Number of data disks in a dRAID redundancy group.
.HP
.BI "\-S" " draid_spares" " (default: 1)"
.IP
Number of dRAID distributed spare disks.
.HP
.BI "\-C" " vdev_class_state" " (default: random)"
.IP
The vdev allocation class state: special=on|off|random.
.HP
.BI "\-d" " datasets" " (default: 7)"
.IP
+25
View File
@@ -2902,6 +2902,31 @@ top-level vdev.
Default value: \fB1,048,576\fR.
.RE
.sp
.ne 2
.na
\fBzfs_rebuild_scrub_enabled\fR (int)
.ad
.RS 12n
Automatically start a pool scrub when the last active sequential resilver
completes in order to verify the checksums of all blocks which have been
resilvered. This option is enabled by default and is strongly recommended.
.sp
Default value: \fB1\fR.
.RE
.sp
.ne 2
.na
\fBzfs_rebuild_vdev_limit\fR (ulong)
.ad
.RS 12n
Maximum amount of i/o that can be concurrently issued for a sequential
resilver per leaf device, given in bytes.
.sp
Default value: \fB33,554,432\fR.
.RE
.sp
.ne 2
.na
+24
View File
@@ -306,6 +306,30 @@ This feature becomes \fBactive\fR when the \fBzpool remove\fR subcommand is used
on a top-level vdev, and will never return to being \fBenabled\fR.
.RE
.sp
.ne 2
.na
\fBdraid\fR
.ad
.RS 4n
.TS
l l .
GUID org.openzfs:draid
READ\-ONLY COMPATIBLE no
DEPENDENCIES none
.TE
This feature enables use of the \fBdraid\fR vdev type. dRAID is a variant
of raidz which provides integrated distributed hot spares that allow faster
resilvering while retaining the benefits of raidz. Data, parity, and spare
space are organized in redundancy groups and distributed evenly over all of
the devices.
This feature becomes \fBactive\fR when creating a pool which uses the
\fBdraid\fR vdev type, or when adding a new \fBdraid\fR vdev to an
existing pool.
.RE
.sp
.ne 2
.na
+2
View File
@@ -73,12 +73,14 @@ and period
The pool names
.Sy mirror ,
.Sy raidz ,
.Sy draid ,
.Sy spare
and
.Sy log
are reserved, as are names beginning with
.Sy mirror ,
.Sy raidz ,
.Sy draid ,
.Sy spare ,
and the pattern
.Sy c[0-9] .
+1 -1
View File
@@ -52,7 +52,7 @@ Begins a scrub or resumes a paused scrub.
The scrub examines all data in the specified pools to verify that it checksums
correctly.
For replicated
.Pq mirror or raidz
.Pq mirror, raidz, or draid
devices, ZFS automatically repairs any damage discovered during the scrub.
The
.Nm zpool Cm status
+75 -3
View File
@@ -64,7 +64,7 @@ A file must be specified by a full path.
A mirror of two or more devices.
Data is replicated in an identical fashion across all components of a mirror.
A mirror with N disks of size X can hold X bytes and can withstand (N-1) devices
failing before data integrity is compromised.
failing without losing data.
.It Sy raidz , raidz1 , raidz2 , raidz3
A variation on RAID-5 that allows for better distribution of parity and
eliminates the RAID-5
@@ -88,11 +88,75 @@ vdev type is an alias for
.Sy raidz1 .
.Pp
A raidz group with N disks of size X with P parity disks can hold approximately
(N-P)*X bytes and can withstand P device(s) failing before data integrity is
compromised.
(N-P)*X bytes and can withstand P device(s) failing without losing data.
The minimum number of devices in a raidz group is one more than the number of
parity disks.
The recommended number is between 3 and 9 to help increase performance.
.It Sy draid , draid1 , draid2 , draid3
A variant of raidz that provides integrated distributed hot spares which
allows for faster resilvering while retaining the benefits of raidz.
A dRAID vdev is constructed from multiple internal raidz groups, each with D
data devices and P parity devices.
These groups are distributed over all of the children in order to fully
utilize the available disk performance.
.Pp
Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
zeros) to allow fully sequential resilvering.
This fixed stripe width significantly effects both usable capacity and IOPS.
For example, with the default D=8 and 4k disk sectors the minimum allocation
size is 32k.
If using compression, this relatively large allocation size can reduce the
effective compression ratio.
When using ZFS volumes and dRAID the default volblocksize property is increased
to account for the allocation size.
If a dRAID pool will hold a significant amount of small blocks, it is
recommended to also add a mirrored
.Sy special
vdev to store those blocks.
.Pp
In regards to IO/s, performance is similar to raidz since for any read all D
data disks must be accessed.
Delivered random IOPS can be reasonably approximated as
floor((N-S)/(D+P))*<single-drive-IOPS>.
.Pp
Like raidz a dRAID can have single-, double-, or triple-parity. The
.Sy draid1 ,
.Sy draid2 ,
and
.Sy draid3
types can be used to specify the parity level.
The
.Sy draid
vdev type is an alias for
.Sy draid1 .
.Pp
A dRAID with N disks of size X, D data disks per redundancy group, P parity
level, and S distributed hot spares can hold approximately (N-S)*(D/(D+P))*X
bytes and can withstand P device(s) failing without losing data.
.It Sy draid[<parity>][:<data>d][:<children>c][:<spares>s]
A non-default dRAID configuration can be specified by appending one or more
of the following optional arguments to the
.Sy draid
keyword.
.Pp
.Em parity
- The parity level (1-3).
.Pp
.Em data
- The number of data devices per redundancy group.
In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity.
Defaults to 8, unless N-P-S is less than 8.
.Pp
.Em children
- The expected number of children.
Useful as a cross-check when listing a large number of devices.
An error is returned when the provided number of children differs.
.Pp
.Em spares
- The number of distributed hot spares.
Defaults to zero.
.Pp
.Pp
.It Sy spare
A pseudo-vdev which keeps track of available hot spares for a pool.
For more information, see the
@@ -273,6 +337,14 @@ If the original faulted device is detached, then the hot spare assumes its
place in the configuration, and is removed from the spare list of all active
pools.
.Pp
The
.Sy draid
vdev type provides distributed hot spares.
These hot spares are named after the dRAID vdev they're a part of (
.Qq draid1-2-3 specifies spare 3 of vdev 2, which is a single parity dRAID
) and may only be used by that dRAID vdev.
Otherwise, they behave the same as normal hot spares.
.Pp
Spares cannot replace log devices.
.Ss Intent Log
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous