2019-11-13 20:21:07 +03:00
|
|
|
.\"
|
|
|
|
.\" CDDL HEADER START
|
|
|
|
.\"
|
|
|
|
.\" The contents of this file are subject to the terms of the
|
|
|
|
.\" Common Development and Distribution License (the "License").
|
|
|
|
.\" You may not use this file except in compliance with the License.
|
|
|
|
.\"
|
|
|
|
.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
.\" or http://www.opensolaris.org/os/licensing.
|
|
|
|
.\" See the License for the specific language governing permissions
|
|
|
|
.\" and limitations under the License.
|
|
|
|
.\"
|
|
|
|
.\" When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
.\" If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
.\" fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
.\" information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
.\"
|
|
|
|
.\" CDDL HEADER END
|
|
|
|
.\"
|
|
|
|
.\"
|
|
|
|
.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
|
|
|
|
.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
|
|
|
|
.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
|
|
|
|
.\" Copyright (c) 2017 Datto Inc.
|
|
|
|
.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
|
|
|
|
.\" Copyright 2017 Nexenta Systems, Inc.
|
|
|
|
.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
|
|
|
|
.\"
|
|
|
|
.Dd August 9, 2019
|
|
|
|
.Dt ZPOOLCONCEPTS 8
|
2020-08-21 21:55:47 +03:00
|
|
|
.Os
|
2019-11-13 20:21:07 +03:00
|
|
|
.Sh NAME
|
|
|
|
.Nm zpoolconcepts
|
|
|
|
.Nd overview of ZFS storage pools
|
|
|
|
.Sh DESCRIPTION
|
|
|
|
.Ss Virtual Devices (vdevs)
|
|
|
|
A "virtual device" describes a single device or a collection of devices
|
|
|
|
organized according to certain performance and fault characteristics.
|
|
|
|
The following virtual devices are supported:
|
|
|
|
.Bl -tag -width Ds
|
|
|
|
.It Sy disk
|
|
|
|
A block device, typically located under
|
|
|
|
.Pa /dev .
|
|
|
|
ZFS can use individual slices or partitions, though the recommended mode of
|
|
|
|
operation is to use whole disks.
|
|
|
|
A disk can be specified by a full path, or it can be a shorthand name
|
|
|
|
.Po the relative portion of the path under
|
|
|
|
.Pa /dev
|
|
|
|
.Pc .
|
|
|
|
A whole disk can be specified by omitting the slice or partition designation.
|
|
|
|
For example,
|
|
|
|
.Pa sda
|
|
|
|
is equivalent to
|
|
|
|
.Pa /dev/sda .
|
|
|
|
When given a whole disk, ZFS automatically labels the disk, if necessary.
|
|
|
|
.It Sy file
|
|
|
|
A regular file.
|
|
|
|
The use of files as a backing store is strongly discouraged.
|
|
|
|
It is designed primarily for experimental purposes, as the fault tolerance of a
|
|
|
|
file is only as good as the file system of which it is a part.
|
|
|
|
A file must be specified by a full path.
|
|
|
|
.It Sy mirror
|
|
|
|
A mirror of two or more devices.
|
|
|
|
Data is replicated in an identical fashion across all components of a mirror.
|
|
|
|
A mirror with N disks of size X can hold X bytes and can withstand (N-1) devices
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
failing without losing data.
|
2019-11-13 20:21:07 +03:00
|
|
|
.It Sy raidz , raidz1 , raidz2 , raidz3
|
|
|
|
A variation on RAID-5 that allows for better distribution of parity and
|
|
|
|
eliminates the RAID-5
|
|
|
|
.Qq write hole
|
|
|
|
.Pq in which data and parity become inconsistent after a power loss .
|
|
|
|
Data and parity is striped across all disks within a raidz group.
|
|
|
|
.Pp
|
|
|
|
A raidz group can have single-, double-, or triple-parity, meaning that the
|
|
|
|
raidz group can sustain one, two, or three failures, respectively, without
|
|
|
|
losing any data.
|
|
|
|
The
|
|
|
|
.Sy raidz1
|
|
|
|
vdev type specifies a single-parity raidz group; the
|
|
|
|
.Sy raidz2
|
|
|
|
vdev type specifies a double-parity raidz group; and the
|
|
|
|
.Sy raidz3
|
|
|
|
vdev type specifies a triple-parity raidz group.
|
|
|
|
The
|
|
|
|
.Sy raidz
|
|
|
|
vdev type is an alias for
|
|
|
|
.Sy raidz1 .
|
|
|
|
.Pp
|
|
|
|
A raidz group with N disks of size X with P parity disks can hold approximately
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
(N-P)*X bytes and can withstand P device(s) failing without losing data.
|
2019-11-13 20:21:07 +03:00
|
|
|
The minimum number of devices in a raidz group is one more than the number of
|
|
|
|
parity disks.
|
|
|
|
The recommended number is between 3 and 9 to help increase performance.
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
.It Sy draid , draid1 , draid2 , draid3
|
|
|
|
A variant of raidz that provides integrated distributed hot spares which
|
|
|
|
allows for faster resilvering while retaining the benefits of raidz.
|
|
|
|
A dRAID vdev is constructed from multiple internal raidz groups, each with D
|
|
|
|
data devices and P parity devices.
|
|
|
|
These groups are distributed over all of the children in order to fully
|
|
|
|
utilize the available disk performance.
|
|
|
|
.Pp
|
|
|
|
Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
|
|
|
|
zeros) to allow fully sequential resilvering.
|
|
|
|
This fixed stripe width significantly effects both usable capacity and IOPS.
|
|
|
|
For example, with the default D=8 and 4k disk sectors the minimum allocation
|
|
|
|
size is 32k.
|
|
|
|
If using compression, this relatively large allocation size can reduce the
|
|
|
|
effective compression ratio.
|
|
|
|
When using ZFS volumes and dRAID the default volblocksize property is increased
|
|
|
|
to account for the allocation size.
|
|
|
|
If a dRAID pool will hold a significant amount of small blocks, it is
|
|
|
|
recommended to also add a mirrored
|
|
|
|
.Sy special
|
|
|
|
vdev to store those blocks.
|
|
|
|
.Pp
|
|
|
|
In regards to IO/s, performance is similar to raidz since for any read all D
|
|
|
|
data disks must be accessed.
|
|
|
|
Delivered random IOPS can be reasonably approximated as
|
|
|
|
floor((N-S)/(D+P))*<single-drive-IOPS>.
|
|
|
|
.Pp
|
|
|
|
Like raidz a dRAID can have single-, double-, or triple-parity. The
|
|
|
|
.Sy draid1 ,
|
|
|
|
.Sy draid2 ,
|
|
|
|
and
|
|
|
|
.Sy draid3
|
|
|
|
types can be used to specify the parity level.
|
|
|
|
The
|
|
|
|
.Sy draid
|
|
|
|
vdev type is an alias for
|
|
|
|
.Sy draid1 .
|
|
|
|
.Pp
|
|
|
|
A dRAID with N disks of size X, D data disks per redundancy group, P parity
|
|
|
|
level, and S distributed hot spares can hold approximately (N-S)*(D/(D+P))*X
|
|
|
|
bytes and can withstand P device(s) failing without losing data.
|
|
|
|
.It Sy draid[<parity>][:<data>d][:<children>c][:<spares>s]
|
|
|
|
A non-default dRAID configuration can be specified by appending one or more
|
|
|
|
of the following optional arguments to the
|
|
|
|
.Sy draid
|
|
|
|
keyword.
|
|
|
|
.Pp
|
|
|
|
.Em parity
|
|
|
|
- The parity level (1-3).
|
|
|
|
.Pp
|
|
|
|
.Em data
|
|
|
|
- The number of data devices per redundancy group.
|
|
|
|
In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity.
|
|
|
|
Defaults to 8, unless N-P-S is less than 8.
|
|
|
|
.Pp
|
|
|
|
.Em children
|
|
|
|
- The expected number of children.
|
|
|
|
Useful as a cross-check when listing a large number of devices.
|
|
|
|
An error is returned when the provided number of children differs.
|
|
|
|
.Pp
|
|
|
|
.Em spares
|
|
|
|
- The number of distributed hot spares.
|
|
|
|
Defaults to zero.
|
|
|
|
.Pp
|
|
|
|
.Pp
|
2019-11-13 20:21:07 +03:00
|
|
|
.It Sy spare
|
|
|
|
A pseudo-vdev which keeps track of available hot spares for a pool.
|
|
|
|
For more information, see the
|
|
|
|
.Sx Hot Spares
|
|
|
|
section.
|
|
|
|
.It Sy log
|
|
|
|
A separate intent log device.
|
|
|
|
If more than one log device is specified, then writes are load-balanced between
|
|
|
|
devices.
|
|
|
|
Log devices can be mirrored.
|
|
|
|
However, raidz vdev types are not supported for the intent log.
|
|
|
|
For more information, see the
|
|
|
|
.Sx Intent Log
|
|
|
|
section.
|
|
|
|
.It Sy dedup
|
|
|
|
A device dedicated solely for deduplication tables.
|
|
|
|
The redundancy of this device should match the redundancy of the other normal
|
|
|
|
devices in the pool. If more than one dedup device is specified, then
|
|
|
|
allocations are load-balanced between those devices.
|
|
|
|
.It Sy special
|
|
|
|
A device dedicated solely for allocating various kinds of internal metadata,
|
|
|
|
and optionally small file blocks.
|
|
|
|
The redundancy of this device should match the redundancy of the other normal
|
|
|
|
devices in the pool. If more than one special device is specified, then
|
|
|
|
allocations are load-balanced between those devices.
|
|
|
|
.Pp
|
|
|
|
For more information on special allocations, see the
|
|
|
|
.Sx Special Allocation Class
|
|
|
|
section.
|
|
|
|
.It Sy cache
|
|
|
|
A device used to cache storage pool data.
|
|
|
|
A cache device cannot be configured as a mirror or raidz group.
|
|
|
|
For more information, see the
|
|
|
|
.Sx Cache Devices
|
|
|
|
section.
|
|
|
|
.El
|
|
|
|
.Pp
|
|
|
|
Virtual devices cannot be nested, so a mirror or raidz virtual device can only
|
|
|
|
contain files or disks.
|
|
|
|
Mirrors of mirrors
|
|
|
|
.Pq or other combinations
|
|
|
|
are not allowed.
|
|
|
|
.Pp
|
|
|
|
A pool can have any number of virtual devices at the top of the configuration
|
|
|
|
.Po known as
|
|
|
|
.Qq root vdevs
|
|
|
|
.Pc .
|
|
|
|
Data is dynamically distributed across all top-level devices to balance data
|
|
|
|
among devices.
|
|
|
|
As new virtual devices are added, ZFS automatically places data on the newly
|
|
|
|
available devices.
|
|
|
|
.Pp
|
|
|
|
Virtual devices are specified one at a time on the command line, separated by
|
|
|
|
whitespace.
|
|
|
|
The keywords
|
|
|
|
.Sy mirror
|
|
|
|
and
|
|
|
|
.Sy raidz
|
|
|
|
are used to distinguish where a group ends and another begins.
|
|
|
|
For example, the following creates two root vdevs, each a mirror of two disks:
|
|
|
|
.Bd -literal
|
|
|
|
# zpool create mypool mirror sda sdb mirror sdc sdd
|
|
|
|
.Ed
|
|
|
|
.Ss Device Failure and Recovery
|
|
|
|
ZFS supports a rich set of mechanisms for handling device failure and data
|
|
|
|
corruption.
|
|
|
|
All metadata and data is checksummed, and ZFS automatically repairs bad data
|
|
|
|
from a good copy when corruption is detected.
|
|
|
|
.Pp
|
|
|
|
In order to take advantage of these features, a pool must make use of some form
|
|
|
|
of redundancy, using either mirrored or raidz groups.
|
|
|
|
While ZFS supports running in a non-redundant configuration, where each root
|
|
|
|
vdev is simply a disk or file, this is strongly discouraged.
|
|
|
|
A single case of bit corruption can render some or all of your data unavailable.
|
|
|
|
.Pp
|
|
|
|
A pool's health status is described by one of three states: online, degraded,
|
|
|
|
or faulted.
|
|
|
|
An online pool has all devices operating normally.
|
|
|
|
A degraded pool is one in which one or more devices have failed, but the data is
|
|
|
|
still available due to a redundant configuration.
|
|
|
|
A faulted pool has corrupted metadata, or one or more faulted devices, and
|
|
|
|
insufficient replicas to continue functioning.
|
|
|
|
.Pp
|
|
|
|
The health of the top-level vdev, such as mirror or raidz device, is
|
|
|
|
potentially impacted by the state of its associated vdevs, or component
|
|
|
|
devices.
|
|
|
|
A top-level vdev or component device is in one of the following states:
|
|
|
|
.Bl -tag -width "DEGRADED"
|
|
|
|
.It Sy DEGRADED
|
|
|
|
One or more top-level vdevs is in the degraded state because one or more
|
|
|
|
component devices are offline.
|
|
|
|
Sufficient replicas exist to continue functioning.
|
|
|
|
.Pp
|
|
|
|
One or more component devices is in the degraded or faulted state, but
|
|
|
|
sufficient replicas exist to continue functioning.
|
|
|
|
The underlying conditions are as follows:
|
|
|
|
.Bl -bullet
|
|
|
|
.It
|
|
|
|
The number of checksum errors exceeds acceptable levels and the device is
|
|
|
|
degraded as an indication that something may be wrong.
|
|
|
|
ZFS continues to use the device as necessary.
|
|
|
|
.It
|
|
|
|
The number of I/O errors exceeds acceptable levels.
|
|
|
|
The device could not be marked as faulted because there are insufficient
|
|
|
|
replicas to continue functioning.
|
|
|
|
.El
|
|
|
|
.It Sy FAULTED
|
|
|
|
One or more top-level vdevs is in the faulted state because one or more
|
|
|
|
component devices are offline.
|
|
|
|
Insufficient replicas exist to continue functioning.
|
|
|
|
.Pp
|
|
|
|
One or more component devices is in the faulted state, and insufficient
|
|
|
|
replicas exist to continue functioning.
|
|
|
|
The underlying conditions are as follows:
|
|
|
|
.Bl -bullet
|
|
|
|
.It
|
|
|
|
The device could be opened, but the contents did not match expected values.
|
|
|
|
.It
|
|
|
|
The number of I/O errors exceeds acceptable levels and the device is faulted to
|
|
|
|
prevent further use of the device.
|
|
|
|
.El
|
|
|
|
.It Sy OFFLINE
|
|
|
|
The device was explicitly taken offline by the
|
|
|
|
.Nm zpool Cm offline
|
|
|
|
command.
|
|
|
|
.It Sy ONLINE
|
|
|
|
The device is online and functioning.
|
|
|
|
.It Sy REMOVED
|
|
|
|
The device was physically removed while the system was running.
|
|
|
|
Device removal detection is hardware-dependent and may not be supported on all
|
|
|
|
platforms.
|
|
|
|
.It Sy UNAVAIL
|
|
|
|
The device could not be opened.
|
|
|
|
If a pool is imported when a device was unavailable, then the device will be
|
|
|
|
identified by a unique identifier instead of its path since the path was never
|
|
|
|
correct in the first place.
|
|
|
|
.El
|
|
|
|
.Pp
|
Clean up RAIDZ/DRAID ereport code
The RAIDZ and DRAID code is responsible for reporting checksum errors on
their child vdevs. Checksum errors represent events where a disk
returned data or parity that should have been correct, but was not. In
other words, these are instances of silent data corruption. The
checksum errors show up in the vdev stats (and thus `zpool status`'s
CKSUM column), and in the event log (`zpool events`).
Note, this is in contrast with the more common "noisy" errors where a
disk goes offline, in which case ZFS knows that the disk is bad and
doesn't try to read it, or the device returns an error on the requested
read or write operation.
RAIDZ/DRAID generate checksum errors via three code paths:
1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
reported on any children whose data was not used during the
reconstruction. This is handled in `raidz_reconstruct()`. This is the
most common type of RAIDZ/DRAID checksum error.
2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
means that the data has been lost. The zio fails and an error is
returned to the consumer (e.g. the read(2) system call). This would
happen if, for example, three different disks in a RAIDZ2 group are
silently damaged. Since the damage is silent, it isn't possible to know
which three disks are damaged, so a checksum error is reported against
every child that returned data or parity for this read. (For DRAID,
typically only one "group" of children is involved in each io.) This
case is handled in `vdev_raidz_cksum_finish()`. This is the next most
common type of RAIDZ/DRAID checksum error.
3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
case 2), but there happens to be additional copies of this block due to
"ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
copies is good, then RAIDZ/DRAID compares each sector of the data or
parity that it retrieved with the good data from the other DVA, and if
they differ then it reports a checksum error on this child. This
differs from case 2 in that the checksum error is reported on only the
subset of children that actually have bad data or parity. This case
happens very rarely, since normally only metadata has ditto blocks. If
the silent damage is extensive, there will be many instances of case 2,
and the pool will likely be unrecoverable.
The code for handling case 3 is considerably more complicated than the
other cases, for two reasons:
1. It needs to run after the main raidz read logic has completed. The
data RAIDZ read needs to be preserved until after the alternate DVA has
been read, which necessitates refcounts and callbacks managed by the
non-raidz-specific zio layer.
2. It's nontrivial to map the sections of data read by RAIDZ to the
correct data. For example, the correct data does not include the parity
information, so the parity must be recalculated based on the correct
data, and then compared to the parity that was read from the RAIDZ
children.
Due to the complexity of case 3, the rareness of hitting it, and the
minimal benefit it provides above case 2, this commit removes the code
for case 3. These types of errors will now be handled the same as case
2, i.e. the checksum error will be reported against all children that
returned data or parity.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11735
2021-03-20 02:22:10 +03:00
|
|
|
Checksum errors represent events where a disk returned data that was expected
|
|
|
|
to be correct, but was not.
|
|
|
|
In other words, these are instances of silent data corruption.
|
|
|
|
The checksum errors are reported in
|
|
|
|
.Nm zpool Cm status
|
|
|
|
and
|
|
|
|
.Nm zpool Cm events .
|
|
|
|
When a block is stored redundantly, a damaged block may be reconstructed
|
|
|
|
(e.g. from RAIDZ parity or a mirrored copy).
|
|
|
|
In this case, ZFS reports the checksum error against the disks that contained
|
|
|
|
damaged data.
|
|
|
|
If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
|
|
|
|
in a RAIDZ2 group), it is not possible to determine which disks were silently
|
|
|
|
corrupted.
|
|
|
|
In this case, checksum errors are reported for all disks on which the block
|
|
|
|
is stored.
|
|
|
|
.Pp
|
2019-11-13 20:21:07 +03:00
|
|
|
If a device is removed and later re-attached to the system, ZFS attempts
|
|
|
|
to put the device online automatically.
|
|
|
|
Device attach detection is hardware-dependent and might not be supported on all
|
|
|
|
platforms.
|
|
|
|
.Ss Hot Spares
|
|
|
|
ZFS allows devices to be associated with pools as
|
|
|
|
.Qq hot spares .
|
|
|
|
These devices are not actively used in the pool, but when an active device
|
|
|
|
fails, it is automatically replaced by a hot spare.
|
|
|
|
To create a pool with hot spares, specify a
|
|
|
|
.Sy spare
|
|
|
|
vdev with any number of devices.
|
|
|
|
For example,
|
|
|
|
.Bd -literal
|
|
|
|
# zpool create pool mirror sda sdb spare sdc sdd
|
|
|
|
.Ed
|
|
|
|
.Pp
|
|
|
|
Spares can be shared across multiple pools, and can be added with the
|
|
|
|
.Nm zpool Cm add
|
|
|
|
command and removed with the
|
|
|
|
.Nm zpool Cm remove
|
|
|
|
command.
|
|
|
|
Once a spare replacement is initiated, a new
|
|
|
|
.Sy spare
|
|
|
|
vdev is created within the configuration that will remain there until the
|
|
|
|
original device is replaced.
|
|
|
|
At this point, the hot spare becomes available again if another device fails.
|
|
|
|
.Pp
|
|
|
|
If a pool has a shared spare that is currently being used, the pool can not be
|
|
|
|
exported since other pools may use this shared spare, which may lead to
|
|
|
|
potential data corruption.
|
|
|
|
.Pp
|
|
|
|
Shared spares add some risk. If the pools are imported on different hosts, and
|
|
|
|
both pools suffer a device failure at the same time, both could attempt to use
|
|
|
|
the spare at the same time. This may not be detected, resulting in data
|
|
|
|
corruption.
|
|
|
|
.Pp
|
|
|
|
An in-progress spare replacement can be cancelled by detaching the hot spare.
|
|
|
|
If the original faulted device is detached, then the hot spare assumes its
|
|
|
|
place in the configuration, and is removed from the spare list of all active
|
|
|
|
pools.
|
|
|
|
.Pp
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
The
|
|
|
|
.Sy draid
|
|
|
|
vdev type provides distributed hot spares.
|
|
|
|
These hot spares are named after the dRAID vdev they're a part of (
|
|
|
|
.Qq draid1-2-3 specifies spare 3 of vdev 2, which is a single parity dRAID
|
|
|
|
) and may only be used by that dRAID vdev.
|
|
|
|
Otherwise, they behave the same as normal hot spares.
|
|
|
|
.Pp
|
2019-11-13 20:21:07 +03:00
|
|
|
Spares cannot replace log devices.
|
|
|
|
.Ss Intent Log
|
|
|
|
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
|
|
|
|
transactions.
|
|
|
|
For instance, databases often require their transactions to be on stable storage
|
|
|
|
devices when returning from a system call.
|
|
|
|
NFS and other applications can also use
|
|
|
|
.Xr fsync 2
|
|
|
|
to ensure data stability.
|
|
|
|
By default, the intent log is allocated from blocks within the main pool.
|
|
|
|
However, it might be possible to get better performance using separate intent
|
|
|
|
log devices such as NVRAM or a dedicated disk.
|
|
|
|
For example:
|
|
|
|
.Bd -literal
|
|
|
|
# zpool create pool sda sdb log sdc
|
|
|
|
.Ed
|
|
|
|
.Pp
|
|
|
|
Multiple log devices can also be specified, and they can be mirrored.
|
|
|
|
See the
|
|
|
|
.Sx EXAMPLES
|
|
|
|
section for an example of mirroring multiple log devices.
|
|
|
|
.Pp
|
|
|
|
Log devices can be added, replaced, attached, detached and removed. In
|
|
|
|
addition, log devices are imported and exported as part of the pool
|
|
|
|
that contains them.
|
|
|
|
Mirrored devices can be removed by specifying the top-level mirror vdev.
|
|
|
|
.Ss Cache Devices
|
|
|
|
Devices can be added to a storage pool as
|
|
|
|
.Qq cache devices .
|
|
|
|
These devices provide an additional layer of caching between main memory and
|
|
|
|
disk.
|
|
|
|
For read-heavy workloads, where the working set size is much larger than what
|
|
|
|
can be cached in main memory, using cache devices allow much more of this
|
|
|
|
working set to be served from low latency media.
|
|
|
|
Using cache devices provides the greatest performance improvement for random
|
|
|
|
read-workloads of mostly static content.
|
|
|
|
.Pp
|
|
|
|
To create a pool with cache devices, specify a
|
|
|
|
.Sy cache
|
|
|
|
vdev with any number of devices.
|
|
|
|
For example:
|
|
|
|
.Bd -literal
|
|
|
|
# zpool create pool sda sdb cache sdc sdd
|
|
|
|
.Ed
|
|
|
|
.Pp
|
|
|
|
Cache devices cannot be mirrored or part of a raidz configuration.
|
|
|
|
If a read error is encountered on a cache device, that read I/O is reissued to
|
|
|
|
the original storage pool device, which might be part of a mirrored or raidz
|
|
|
|
configuration.
|
|
|
|
.Pp
|
2020-04-10 20:33:35 +03:00
|
|
|
The content of the cache devices is persistent across reboots and restored
|
|
|
|
asynchronously when importing the pool in L2ARC (persistent L2ARC).
|
|
|
|
This can be disabled by setting
|
|
|
|
.Sy l2arc_rebuild_enabled = 0 .
|
|
|
|
For cache devices smaller than 1GB we do not write the metadata structures
|
|
|
|
required for rebuilding the L2ARC in order not to waste space. This can be
|
|
|
|
changed with
|
|
|
|
.Sy l2arc_rebuild_blocks_min_l2size .
|
|
|
|
The cache device header (512 bytes) is updated even if no metadata structures
|
|
|
|
are written. Setting
|
|
|
|
.Sy l2arc_headroom = 0
|
|
|
|
will result in scanning the full-length ARC lists for cacheable content to be
|
|
|
|
written in L2ARC (persistent ARC). If a cache device is added with
|
|
|
|
.Nm zpool Cm add
|
|
|
|
its label and header will be overwritten and its contents are not going to be
|
|
|
|
restored in L2ARC, even if the device was previously part of the pool. If a
|
|
|
|
cache device is onlined with
|
|
|
|
.Nm zpool Cm online
|
|
|
|
its contents will be restored in L2ARC. This is useful in case of memory pressure
|
|
|
|
where the contents of the cache device are not fully restored in L2ARC.
|
|
|
|
The user can off/online the cache device when there is less memory pressure
|
|
|
|
in order to fully restore its contents to L2ARC.
|
2019-11-13 20:21:07 +03:00
|
|
|
.Ss Pool checkpoint
|
|
|
|
Before starting critical procedures that include destructive actions (e.g
|
|
|
|
.Nm zfs Cm destroy
|
|
|
|
), an administrator can checkpoint the pool's state and in the case of a
|
|
|
|
mistake or failure, rewind the entire pool back to the checkpoint.
|
|
|
|
Otherwise, the checkpoint can be discarded when the procedure has completed
|
|
|
|
successfully.
|
|
|
|
.Pp
|
|
|
|
A pool checkpoint can be thought of as a pool-wide snapshot and should be used
|
|
|
|
with care as it contains every part of the pool's state, from properties to vdev
|
|
|
|
configuration.
|
|
|
|
Thus, while a pool has a checkpoint certain operations are not allowed.
|
|
|
|
Specifically, vdev removal/attach/detach, mirror splitting, and
|
|
|
|
changing the pool's guid.
|
|
|
|
Adding a new vdev is supported but in the case of a rewind it will have to be
|
|
|
|
added again.
|
|
|
|
Finally, users of this feature should keep in mind that scrubs in a pool that
|
|
|
|
has a checkpoint do not repair checkpointed data.
|
|
|
|
.Pp
|
|
|
|
To create a checkpoint for a pool:
|
|
|
|
.Bd -literal
|
|
|
|
# zpool checkpoint pool
|
|
|
|
.Ed
|
|
|
|
.Pp
|
|
|
|
To later rewind to its checkpointed state, you need to first export it and
|
|
|
|
then rewind it during import:
|
|
|
|
.Bd -literal
|
|
|
|
# zpool export pool
|
|
|
|
# zpool import --rewind-to-checkpoint pool
|
|
|
|
.Ed
|
|
|
|
.Pp
|
|
|
|
To discard the checkpoint from a pool:
|
|
|
|
.Bd -literal
|
|
|
|
# zpool checkpoint -d pool
|
|
|
|
.Ed
|
|
|
|
.Pp
|
|
|
|
Dataset reservations (controlled by the
|
|
|
|
.Nm reservation
|
|
|
|
or
|
|
|
|
.Nm refreservation
|
|
|
|
zfs properties) may be unenforceable while a checkpoint exists, because the
|
|
|
|
checkpoint is allowed to consume the dataset's reservation.
|
|
|
|
Finally, data that is part of the checkpoint but has been freed in the
|
|
|
|
current state of the pool won't be scanned during a scrub.
|
|
|
|
.Ss Special Allocation Class
|
|
|
|
The allocations in the special class are dedicated to specific block types.
|
|
|
|
By default this includes all metadata, the indirect blocks of user data, and
|
|
|
|
any deduplication tables. The class can also be provisioned to accept
|
|
|
|
small file blocks.
|
|
|
|
.Pp
|
|
|
|
A pool must always have at least one normal (non-dedup/special) vdev before
|
|
|
|
other devices can be assigned to the special class. If the special class
|
|
|
|
becomes full, then allocations intended for it will spill back into the
|
|
|
|
normal class.
|
|
|
|
.Pp
|
|
|
|
Deduplication tables can be excluded from the special class by setting the
|
|
|
|
.Sy zfs_ddt_data_is_special
|
|
|
|
zfs module parameter to false (0).
|
|
|
|
.Pp
|
|
|
|
Inclusion of small file blocks in the special class is opt-in. Each dataset
|
|
|
|
can control the size of small file blocks allowed in the special class by
|
|
|
|
setting the
|
|
|
|
.Sy special_small_blocks
|
|
|
|
dataset property. It defaults to zero, so you must opt-in by setting it to a
|
|
|
|
non-zero value. See
|
|
|
|
.Xr zfs 8
|
|
|
|
for more info on setting this property.
|