mirror of
https://git.proxmox.com/git/mirror_zfs.git
synced 2025-01-27 18:34:22 +03:00
b2255edcc0
This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid|raidz|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <he.huang@intel.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Co-authored-by: Don Brady <don.brady@delphix.com> Co-authored-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10102
487 lines
18 KiB
Groff
487 lines
18 KiB
Groff
.\"
|
|
.\" CDDL HEADER START
|
|
.\"
|
|
.\" The contents of this file are subject to the terms of the
|
|
.\" Common Development and Distribution License (the "License").
|
|
.\" You may not use this file except in compliance with the License.
|
|
.\"
|
|
.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
.\" or http://www.opensolaris.org/os/licensing.
|
|
.\" See the License for the specific language governing permissions
|
|
.\" and limitations under the License.
|
|
.\"
|
|
.\" When distributing Covered Code, include this CDDL HEADER in each
|
|
.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
.\" If applicable, add the following below this CDDL HEADER, with the
|
|
.\" fields enclosed by brackets "[]" replaced with your own identifying
|
|
.\" information: Portions Copyright [yyyy] [name of copyright owner]
|
|
.\"
|
|
.\" CDDL HEADER END
|
|
.\"
|
|
.\"
|
|
.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
|
|
.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
|
|
.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
|
|
.\" Copyright (c) 2017 Datto Inc.
|
|
.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
|
|
.\" Copyright 2017 Nexenta Systems, Inc.
|
|
.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
|
|
.\"
|
|
.Dd August 9, 2019
|
|
.Dt ZPOOLCONCEPTS 8
|
|
.Os
|
|
.Sh NAME
|
|
.Nm zpoolconcepts
|
|
.Nd overview of ZFS storage pools
|
|
.Sh DESCRIPTION
|
|
.Ss Virtual Devices (vdevs)
|
|
A "virtual device" describes a single device or a collection of devices
|
|
organized according to certain performance and fault characteristics.
|
|
The following virtual devices are supported:
|
|
.Bl -tag -width Ds
|
|
.It Sy disk
|
|
A block device, typically located under
|
|
.Pa /dev .
|
|
ZFS can use individual slices or partitions, though the recommended mode of
|
|
operation is to use whole disks.
|
|
A disk can be specified by a full path, or it can be a shorthand name
|
|
.Po the relative portion of the path under
|
|
.Pa /dev
|
|
.Pc .
|
|
A whole disk can be specified by omitting the slice or partition designation.
|
|
For example,
|
|
.Pa sda
|
|
is equivalent to
|
|
.Pa /dev/sda .
|
|
When given a whole disk, ZFS automatically labels the disk, if necessary.
|
|
.It Sy file
|
|
A regular file.
|
|
The use of files as a backing store is strongly discouraged.
|
|
It is designed primarily for experimental purposes, as the fault tolerance of a
|
|
file is only as good as the file system of which it is a part.
|
|
A file must be specified by a full path.
|
|
.It Sy mirror
|
|
A mirror of two or more devices.
|
|
Data is replicated in an identical fashion across all components of a mirror.
|
|
A mirror with N disks of size X can hold X bytes and can withstand (N-1) devices
|
|
failing without losing data.
|
|
.It Sy raidz , raidz1 , raidz2 , raidz3
|
|
A variation on RAID-5 that allows for better distribution of parity and
|
|
eliminates the RAID-5
|
|
.Qq write hole
|
|
.Pq in which data and parity become inconsistent after a power loss .
|
|
Data and parity is striped across all disks within a raidz group.
|
|
.Pp
|
|
A raidz group can have single-, double-, or triple-parity, meaning that the
|
|
raidz group can sustain one, two, or three failures, respectively, without
|
|
losing any data.
|
|
The
|
|
.Sy raidz1
|
|
vdev type specifies a single-parity raidz group; the
|
|
.Sy raidz2
|
|
vdev type specifies a double-parity raidz group; and the
|
|
.Sy raidz3
|
|
vdev type specifies a triple-parity raidz group.
|
|
The
|
|
.Sy raidz
|
|
vdev type is an alias for
|
|
.Sy raidz1 .
|
|
.Pp
|
|
A raidz group with N disks of size X with P parity disks can hold approximately
|
|
(N-P)*X bytes and can withstand P device(s) failing without losing data.
|
|
The minimum number of devices in a raidz group is one more than the number of
|
|
parity disks.
|
|
The recommended number is between 3 and 9 to help increase performance.
|
|
.It Sy draid , draid1 , draid2 , draid3
|
|
A variant of raidz that provides integrated distributed hot spares which
|
|
allows for faster resilvering while retaining the benefits of raidz.
|
|
A dRAID vdev is constructed from multiple internal raidz groups, each with D
|
|
data devices and P parity devices.
|
|
These groups are distributed over all of the children in order to fully
|
|
utilize the available disk performance.
|
|
.Pp
|
|
Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
|
|
zeros) to allow fully sequential resilvering.
|
|
This fixed stripe width significantly effects both usable capacity and IOPS.
|
|
For example, with the default D=8 and 4k disk sectors the minimum allocation
|
|
size is 32k.
|
|
If using compression, this relatively large allocation size can reduce the
|
|
effective compression ratio.
|
|
When using ZFS volumes and dRAID the default volblocksize property is increased
|
|
to account for the allocation size.
|
|
If a dRAID pool will hold a significant amount of small blocks, it is
|
|
recommended to also add a mirrored
|
|
.Sy special
|
|
vdev to store those blocks.
|
|
.Pp
|
|
In regards to IO/s, performance is similar to raidz since for any read all D
|
|
data disks must be accessed.
|
|
Delivered random IOPS can be reasonably approximated as
|
|
floor((N-S)/(D+P))*<single-drive-IOPS>.
|
|
.Pp
|
|
Like raidz a dRAID can have single-, double-, or triple-parity. The
|
|
.Sy draid1 ,
|
|
.Sy draid2 ,
|
|
and
|
|
.Sy draid3
|
|
types can be used to specify the parity level.
|
|
The
|
|
.Sy draid
|
|
vdev type is an alias for
|
|
.Sy draid1 .
|
|
.Pp
|
|
A dRAID with N disks of size X, D data disks per redundancy group, P parity
|
|
level, and S distributed hot spares can hold approximately (N-S)*(D/(D+P))*X
|
|
bytes and can withstand P device(s) failing without losing data.
|
|
.It Sy draid[<parity>][:<data>d][:<children>c][:<spares>s]
|
|
A non-default dRAID configuration can be specified by appending one or more
|
|
of the following optional arguments to the
|
|
.Sy draid
|
|
keyword.
|
|
.Pp
|
|
.Em parity
|
|
- The parity level (1-3).
|
|
.Pp
|
|
.Em data
|
|
- The number of data devices per redundancy group.
|
|
In general a smaller value of D will increase IOPS, improve the compression ratio, and speed up resilvering at the expense of total usable capacity.
|
|
Defaults to 8, unless N-P-S is less than 8.
|
|
.Pp
|
|
.Em children
|
|
- The expected number of children.
|
|
Useful as a cross-check when listing a large number of devices.
|
|
An error is returned when the provided number of children differs.
|
|
.Pp
|
|
.Em spares
|
|
- The number of distributed hot spares.
|
|
Defaults to zero.
|
|
.Pp
|
|
.Pp
|
|
.It Sy spare
|
|
A pseudo-vdev which keeps track of available hot spares for a pool.
|
|
For more information, see the
|
|
.Sx Hot Spares
|
|
section.
|
|
.It Sy log
|
|
A separate intent log device.
|
|
If more than one log device is specified, then writes are load-balanced between
|
|
devices.
|
|
Log devices can be mirrored.
|
|
However, raidz vdev types are not supported for the intent log.
|
|
For more information, see the
|
|
.Sx Intent Log
|
|
section.
|
|
.It Sy dedup
|
|
A device dedicated solely for deduplication tables.
|
|
The redundancy of this device should match the redundancy of the other normal
|
|
devices in the pool. If more than one dedup device is specified, then
|
|
allocations are load-balanced between those devices.
|
|
.It Sy special
|
|
A device dedicated solely for allocating various kinds of internal metadata,
|
|
and optionally small file blocks.
|
|
The redundancy of this device should match the redundancy of the other normal
|
|
devices in the pool. If more than one special device is specified, then
|
|
allocations are load-balanced between those devices.
|
|
.Pp
|
|
For more information on special allocations, see the
|
|
.Sx Special Allocation Class
|
|
section.
|
|
.It Sy cache
|
|
A device used to cache storage pool data.
|
|
A cache device cannot be configured as a mirror or raidz group.
|
|
For more information, see the
|
|
.Sx Cache Devices
|
|
section.
|
|
.El
|
|
.Pp
|
|
Virtual devices cannot be nested, so a mirror or raidz virtual device can only
|
|
contain files or disks.
|
|
Mirrors of mirrors
|
|
.Pq or other combinations
|
|
are not allowed.
|
|
.Pp
|
|
A pool can have any number of virtual devices at the top of the configuration
|
|
.Po known as
|
|
.Qq root vdevs
|
|
.Pc .
|
|
Data is dynamically distributed across all top-level devices to balance data
|
|
among devices.
|
|
As new virtual devices are added, ZFS automatically places data on the newly
|
|
available devices.
|
|
.Pp
|
|
Virtual devices are specified one at a time on the command line, separated by
|
|
whitespace.
|
|
The keywords
|
|
.Sy mirror
|
|
and
|
|
.Sy raidz
|
|
are used to distinguish where a group ends and another begins.
|
|
For example, the following creates two root vdevs, each a mirror of two disks:
|
|
.Bd -literal
|
|
# zpool create mypool mirror sda sdb mirror sdc sdd
|
|
.Ed
|
|
.Ss Device Failure and Recovery
|
|
ZFS supports a rich set of mechanisms for handling device failure and data
|
|
corruption.
|
|
All metadata and data is checksummed, and ZFS automatically repairs bad data
|
|
from a good copy when corruption is detected.
|
|
.Pp
|
|
In order to take advantage of these features, a pool must make use of some form
|
|
of redundancy, using either mirrored or raidz groups.
|
|
While ZFS supports running in a non-redundant configuration, where each root
|
|
vdev is simply a disk or file, this is strongly discouraged.
|
|
A single case of bit corruption can render some or all of your data unavailable.
|
|
.Pp
|
|
A pool's health status is described by one of three states: online, degraded,
|
|
or faulted.
|
|
An online pool has all devices operating normally.
|
|
A degraded pool is one in which one or more devices have failed, but the data is
|
|
still available due to a redundant configuration.
|
|
A faulted pool has corrupted metadata, or one or more faulted devices, and
|
|
insufficient replicas to continue functioning.
|
|
.Pp
|
|
The health of the top-level vdev, such as mirror or raidz device, is
|
|
potentially impacted by the state of its associated vdevs, or component
|
|
devices.
|
|
A top-level vdev or component device is in one of the following states:
|
|
.Bl -tag -width "DEGRADED"
|
|
.It Sy DEGRADED
|
|
One or more top-level vdevs is in the degraded state because one or more
|
|
component devices are offline.
|
|
Sufficient replicas exist to continue functioning.
|
|
.Pp
|
|
One or more component devices is in the degraded or faulted state, but
|
|
sufficient replicas exist to continue functioning.
|
|
The underlying conditions are as follows:
|
|
.Bl -bullet
|
|
.It
|
|
The number of checksum errors exceeds acceptable levels and the device is
|
|
degraded as an indication that something may be wrong.
|
|
ZFS continues to use the device as necessary.
|
|
.It
|
|
The number of I/O errors exceeds acceptable levels.
|
|
The device could not be marked as faulted because there are insufficient
|
|
replicas to continue functioning.
|
|
.El
|
|
.It Sy FAULTED
|
|
One or more top-level vdevs is in the faulted state because one or more
|
|
component devices are offline.
|
|
Insufficient replicas exist to continue functioning.
|
|
.Pp
|
|
One or more component devices is in the faulted state, and insufficient
|
|
replicas exist to continue functioning.
|
|
The underlying conditions are as follows:
|
|
.Bl -bullet
|
|
.It
|
|
The device could be opened, but the contents did not match expected values.
|
|
.It
|
|
The number of I/O errors exceeds acceptable levels and the device is faulted to
|
|
prevent further use of the device.
|
|
.El
|
|
.It Sy OFFLINE
|
|
The device was explicitly taken offline by the
|
|
.Nm zpool Cm offline
|
|
command.
|
|
.It Sy ONLINE
|
|
The device is online and functioning.
|
|
.It Sy REMOVED
|
|
The device was physically removed while the system was running.
|
|
Device removal detection is hardware-dependent and may not be supported on all
|
|
platforms.
|
|
.It Sy UNAVAIL
|
|
The device could not be opened.
|
|
If a pool is imported when a device was unavailable, then the device will be
|
|
identified by a unique identifier instead of its path since the path was never
|
|
correct in the first place.
|
|
.El
|
|
.Pp
|
|
If a device is removed and later re-attached to the system, ZFS attempts
|
|
to put the device online automatically.
|
|
Device attach detection is hardware-dependent and might not be supported on all
|
|
platforms.
|
|
.Ss Hot Spares
|
|
ZFS allows devices to be associated with pools as
|
|
.Qq hot spares .
|
|
These devices are not actively used in the pool, but when an active device
|
|
fails, it is automatically replaced by a hot spare.
|
|
To create a pool with hot spares, specify a
|
|
.Sy spare
|
|
vdev with any number of devices.
|
|
For example,
|
|
.Bd -literal
|
|
# zpool create pool mirror sda sdb spare sdc sdd
|
|
.Ed
|
|
.Pp
|
|
Spares can be shared across multiple pools, and can be added with the
|
|
.Nm zpool Cm add
|
|
command and removed with the
|
|
.Nm zpool Cm remove
|
|
command.
|
|
Once a spare replacement is initiated, a new
|
|
.Sy spare
|
|
vdev is created within the configuration that will remain there until the
|
|
original device is replaced.
|
|
At this point, the hot spare becomes available again if another device fails.
|
|
.Pp
|
|
If a pool has a shared spare that is currently being used, the pool can not be
|
|
exported since other pools may use this shared spare, which may lead to
|
|
potential data corruption.
|
|
.Pp
|
|
Shared spares add some risk. If the pools are imported on different hosts, and
|
|
both pools suffer a device failure at the same time, both could attempt to use
|
|
the spare at the same time. This may not be detected, resulting in data
|
|
corruption.
|
|
.Pp
|
|
An in-progress spare replacement can be cancelled by detaching the hot spare.
|
|
If the original faulted device is detached, then the hot spare assumes its
|
|
place in the configuration, and is removed from the spare list of all active
|
|
pools.
|
|
.Pp
|
|
The
|
|
.Sy draid
|
|
vdev type provides distributed hot spares.
|
|
These hot spares are named after the dRAID vdev they're a part of (
|
|
.Qq draid1-2-3 specifies spare 3 of vdev 2, which is a single parity dRAID
|
|
) and may only be used by that dRAID vdev.
|
|
Otherwise, they behave the same as normal hot spares.
|
|
.Pp
|
|
Spares cannot replace log devices.
|
|
.Ss Intent Log
|
|
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
|
|
transactions.
|
|
For instance, databases often require their transactions to be on stable storage
|
|
devices when returning from a system call.
|
|
NFS and other applications can also use
|
|
.Xr fsync 2
|
|
to ensure data stability.
|
|
By default, the intent log is allocated from blocks within the main pool.
|
|
However, it might be possible to get better performance using separate intent
|
|
log devices such as NVRAM or a dedicated disk.
|
|
For example:
|
|
.Bd -literal
|
|
# zpool create pool sda sdb log sdc
|
|
.Ed
|
|
.Pp
|
|
Multiple log devices can also be specified, and they can be mirrored.
|
|
See the
|
|
.Sx EXAMPLES
|
|
section for an example of mirroring multiple log devices.
|
|
.Pp
|
|
Log devices can be added, replaced, attached, detached and removed. In
|
|
addition, log devices are imported and exported as part of the pool
|
|
that contains them.
|
|
Mirrored devices can be removed by specifying the top-level mirror vdev.
|
|
.Ss Cache Devices
|
|
Devices can be added to a storage pool as
|
|
.Qq cache devices .
|
|
These devices provide an additional layer of caching between main memory and
|
|
disk.
|
|
For read-heavy workloads, where the working set size is much larger than what
|
|
can be cached in main memory, using cache devices allow much more of this
|
|
working set to be served from low latency media.
|
|
Using cache devices provides the greatest performance improvement for random
|
|
read-workloads of mostly static content.
|
|
.Pp
|
|
To create a pool with cache devices, specify a
|
|
.Sy cache
|
|
vdev with any number of devices.
|
|
For example:
|
|
.Bd -literal
|
|
# zpool create pool sda sdb cache sdc sdd
|
|
.Ed
|
|
.Pp
|
|
Cache devices cannot be mirrored or part of a raidz configuration.
|
|
If a read error is encountered on a cache device, that read I/O is reissued to
|
|
the original storage pool device, which might be part of a mirrored or raidz
|
|
configuration.
|
|
.Pp
|
|
The content of the cache devices is persistent across reboots and restored
|
|
asynchronously when importing the pool in L2ARC (persistent L2ARC).
|
|
This can be disabled by setting
|
|
.Sy l2arc_rebuild_enabled = 0 .
|
|
For cache devices smaller than 1GB we do not write the metadata structures
|
|
required for rebuilding the L2ARC in order not to waste space. This can be
|
|
changed with
|
|
.Sy l2arc_rebuild_blocks_min_l2size .
|
|
The cache device header (512 bytes) is updated even if no metadata structures
|
|
are written. Setting
|
|
.Sy l2arc_headroom = 0
|
|
will result in scanning the full-length ARC lists for cacheable content to be
|
|
written in L2ARC (persistent ARC). If a cache device is added with
|
|
.Nm zpool Cm add
|
|
its label and header will be overwritten and its contents are not going to be
|
|
restored in L2ARC, even if the device was previously part of the pool. If a
|
|
cache device is onlined with
|
|
.Nm zpool Cm online
|
|
its contents will be restored in L2ARC. This is useful in case of memory pressure
|
|
where the contents of the cache device are not fully restored in L2ARC.
|
|
The user can off/online the cache device when there is less memory pressure
|
|
in order to fully restore its contents to L2ARC.
|
|
.Ss Pool checkpoint
|
|
Before starting critical procedures that include destructive actions (e.g
|
|
.Nm zfs Cm destroy
|
|
), an administrator can checkpoint the pool's state and in the case of a
|
|
mistake or failure, rewind the entire pool back to the checkpoint.
|
|
Otherwise, the checkpoint can be discarded when the procedure has completed
|
|
successfully.
|
|
.Pp
|
|
A pool checkpoint can be thought of as a pool-wide snapshot and should be used
|
|
with care as it contains every part of the pool's state, from properties to vdev
|
|
configuration.
|
|
Thus, while a pool has a checkpoint certain operations are not allowed.
|
|
Specifically, vdev removal/attach/detach, mirror splitting, and
|
|
changing the pool's guid.
|
|
Adding a new vdev is supported but in the case of a rewind it will have to be
|
|
added again.
|
|
Finally, users of this feature should keep in mind that scrubs in a pool that
|
|
has a checkpoint do not repair checkpointed data.
|
|
.Pp
|
|
To create a checkpoint for a pool:
|
|
.Bd -literal
|
|
# zpool checkpoint pool
|
|
.Ed
|
|
.Pp
|
|
To later rewind to its checkpointed state, you need to first export it and
|
|
then rewind it during import:
|
|
.Bd -literal
|
|
# zpool export pool
|
|
# zpool import --rewind-to-checkpoint pool
|
|
.Ed
|
|
.Pp
|
|
To discard the checkpoint from a pool:
|
|
.Bd -literal
|
|
# zpool checkpoint -d pool
|
|
.Ed
|
|
.Pp
|
|
Dataset reservations (controlled by the
|
|
.Nm reservation
|
|
or
|
|
.Nm refreservation
|
|
zfs properties) may be unenforceable while a checkpoint exists, because the
|
|
checkpoint is allowed to consume the dataset's reservation.
|
|
Finally, data that is part of the checkpoint but has been freed in the
|
|
current state of the pool won't be scanned during a scrub.
|
|
.Ss Special Allocation Class
|
|
The allocations in the special class are dedicated to specific block types.
|
|
By default this includes all metadata, the indirect blocks of user data, and
|
|
any deduplication tables. The class can also be provisioned to accept
|
|
small file blocks.
|
|
.Pp
|
|
A pool must always have at least one normal (non-dedup/special) vdev before
|
|
other devices can be assigned to the special class. If the special class
|
|
becomes full, then allocations intended for it will spill back into the
|
|
normal class.
|
|
.Pp
|
|
Deduplication tables can be excluded from the special class by setting the
|
|
.Sy zfs_ddt_data_is_special
|
|
zfs module parameter to false (0).
|
|
.Pp
|
|
Inclusion of small file blocks in the special class is opt-in. Each dataset
|
|
can control the size of small file blocks allowed in the special class by
|
|
setting the
|
|
.Sy special_small_blocks
|
|
dataset property. It defaults to zero, so you must opt-in by setting it to a
|
|
non-zero value. See
|
|
.Xr zfs 8
|
|
for more info on setting this property.
|