mirror of
https://git.proxmox.com/git/mirror_zfs.git
synced 2024-12-25 18:59:33 +03:00
cbe882298e
Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15469
513 lines
20 KiB
Groff
513 lines
20 KiB
Groff
.\"
|
|
.\" CDDL HEADER START
|
|
.\"
|
|
.\" The contents of this file are subject to the terms of the
|
|
.\" Common Development and Distribution License (the "License").
|
|
.\" You may not use this file except in compliance with the License.
|
|
.\"
|
|
.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
.\" or https://opensource.org/licenses/CDDL-1.0.
|
|
.\" See the License for the specific language governing permissions
|
|
.\" and limitations under the License.
|
|
.\"
|
|
.\" When distributing Covered Code, include this CDDL HEADER in each
|
|
.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
.\" If applicable, add the following below this CDDL HEADER, with the
|
|
.\" fields enclosed by brackets "[]" replaced with your own identifying
|
|
.\" information: Portions Copyright [yyyy] [name of copyright owner]
|
|
.\"
|
|
.\" CDDL HEADER END
|
|
.\"
|
|
.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
|
|
.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
|
|
.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
|
|
.\" Copyright (c) 2017 Datto Inc.
|
|
.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
|
|
.\" Copyright 2017 Nexenta Systems, Inc.
|
|
.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
|
|
.\"
|
|
.Dd April 7, 2023
|
|
.Dt ZPOOLCONCEPTS 7
|
|
.Os
|
|
.
|
|
.Sh NAME
|
|
.Nm zpoolconcepts
|
|
.Nd overview of ZFS storage pools
|
|
.
|
|
.Sh DESCRIPTION
|
|
.Ss Virtual Devices (vdevs)
|
|
A "virtual device" describes a single device or a collection of devices,
|
|
organized according to certain performance and fault characteristics.
|
|
The following virtual devices are supported:
|
|
.Bl -tag -width "special"
|
|
.It Sy disk
|
|
A block device, typically located under
|
|
.Pa /dev .
|
|
ZFS can use individual slices or partitions, though the recommended mode of
|
|
operation is to use whole disks.
|
|
A disk can be specified by a full path, or it can be a shorthand name
|
|
.Po the relative portion of the path under
|
|
.Pa /dev
|
|
.Pc .
|
|
A whole disk can be specified by omitting the slice or partition designation.
|
|
For example,
|
|
.Pa sda
|
|
is equivalent to
|
|
.Pa /dev/sda .
|
|
When given a whole disk, ZFS automatically labels the disk, if necessary.
|
|
.It Sy file
|
|
A regular file.
|
|
The use of files as a backing store is strongly discouraged.
|
|
It is designed primarily for experimental purposes, as the fault tolerance of a
|
|
file is only as good as the file system on which it resides.
|
|
A file must be specified by a full path.
|
|
.It Sy mirror
|
|
A mirror of two or more devices.
|
|
Data is replicated in an identical fashion across all components of a mirror.
|
|
A mirror with
|
|
.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
|
|
devices failing, without losing data.
|
|
.It Sy raidz , raidz1 , raidz2 , raidz3
|
|
A distributed-parity layout, similar to RAID-5/6, with improved distribution of
|
|
parity, and which does not suffer from the RAID-5/6
|
|
.Qq write hole ,
|
|
.Pq in which data and parity become inconsistent after a power loss .
|
|
Data and parity is striped across all disks within a raidz group, though not
|
|
necessarily in a consistent stripe width.
|
|
.Pp
|
|
A raidz group can have single, double, or triple parity, meaning that the
|
|
raidz group can sustain one, two, or three failures, respectively, without
|
|
losing any data.
|
|
The
|
|
.Sy raidz1
|
|
vdev type specifies a single-parity raidz group; the
|
|
.Sy raidz2
|
|
vdev type specifies a double-parity raidz group; and the
|
|
.Sy raidz3
|
|
vdev type specifies a triple-parity raidz group.
|
|
The
|
|
.Sy raidz
|
|
vdev type is an alias for
|
|
.Sy raidz1 .
|
|
.Pp
|
|
A raidz group with
|
|
.Em N No disks of size Em X No with Em P No parity disks can hold approximately
|
|
.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
|
|
The minimum number of devices in a raidz group is one more than the number of
|
|
parity disks.
|
|
The recommended number is between 3 and 9 to help increase performance.
|
|
.It Sy draid , draid1 , draid2 , draid3
|
|
A variant of raidz that provides integrated distributed hot spares, allowing
|
|
for faster resilvering, while retaining the benefits of raidz.
|
|
A dRAID vdev is constructed from multiple internal raidz groups, each with
|
|
.Em D No data devices and Em P No parity devices .
|
|
These groups are distributed over all of the children in order to fully
|
|
utilize the available disk performance.
|
|
.Pp
|
|
Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
|
|
zeros) to allow fully sequential resilvering.
|
|
This fixed stripe width significantly affects both usable capacity and IOPS.
|
|
For example, with the default
|
|
.Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
|
|
If using compression, this relatively large allocation size can reduce the
|
|
effective compression ratio.
|
|
When using ZFS volumes (zvols) and dRAID, the default of the
|
|
.Sy volblocksize
|
|
property is increased to account for the allocation size.
|
|
If a dRAID pool will hold a significant amount of small blocks, it is
|
|
recommended to also add a mirrored
|
|
.Sy special
|
|
vdev to store those blocks.
|
|
.Pp
|
|
In regards to I/O, performance is similar to raidz since, for any read, all
|
|
.Em D No data disks must be accessed .
|
|
Delivered random IOPS can be reasonably approximated as
|
|
.Sy floor((N-S)/(D+P))*single_drive_IOPS .
|
|
.Pp
|
|
Like raidz, a dRAID can have single-, double-, or triple-parity.
|
|
The
|
|
.Sy draid1 ,
|
|
.Sy draid2 ,
|
|
and
|
|
.Sy draid3
|
|
types can be used to specify the parity level.
|
|
The
|
|
.Sy draid
|
|
vdev type is an alias for
|
|
.Sy draid1 .
|
|
.Pp
|
|
A dRAID with
|
|
.Em N No disks of size Em X , D No data disks per redundancy group , Em P
|
|
.No parity level, and Em S No distributed hot spares can hold approximately
|
|
.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
|
|
devices failing without losing data.
|
|
.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
|
|
A non-default dRAID configuration can be specified by appending one or more
|
|
of the following optional arguments to the
|
|
.Sy draid
|
|
keyword:
|
|
.Bl -tag -compact -width "children"
|
|
.It Ar parity
|
|
The parity level (1-3).
|
|
.It Ar data
|
|
The number of data devices per redundancy group.
|
|
In general, a smaller value of
|
|
.Em D No will increase IOPS, improve the compression ratio ,
|
|
and speed up resilvering at the expense of total usable capacity.
|
|
Defaults to
|
|
.Em 8 , No unless Em N-P-S No is less than Em 8 .
|
|
.It Ar children
|
|
The expected number of children.
|
|
Useful as a cross-check when listing a large number of devices.
|
|
An error is returned when the provided number of children differs.
|
|
.It Ar spares
|
|
The number of distributed hot spares.
|
|
Defaults to zero.
|
|
.El
|
|
.It Sy spare
|
|
A pseudo-vdev which keeps track of available hot spares for a pool.
|
|
For more information, see the
|
|
.Sx Hot Spares
|
|
section.
|
|
.It Sy log
|
|
A separate intent log device.
|
|
If more than one log device is specified, then writes are load-balanced between
|
|
devices.
|
|
Log devices can be mirrored.
|
|
However, raidz vdev types are not supported for the intent log.
|
|
For more information, see the
|
|
.Sx Intent Log
|
|
section.
|
|
.It Sy dedup
|
|
A device solely dedicated for deduplication tables.
|
|
The redundancy of this device should match the redundancy of the other normal
|
|
devices in the pool.
|
|
If more than one dedup device is specified, then
|
|
allocations are load-balanced between those devices.
|
|
.It Sy special
|
|
A device dedicated solely for allocating various kinds of internal metadata,
|
|
and optionally small file blocks.
|
|
The redundancy of this device should match the redundancy of the other normal
|
|
devices in the pool.
|
|
If more than one special device is specified, then
|
|
allocations are load-balanced between those devices.
|
|
.Pp
|
|
For more information on special allocations, see the
|
|
.Sx Special Allocation Class
|
|
section.
|
|
.It Sy cache
|
|
A device used to cache storage pool data.
|
|
A cache device cannot be configured as a mirror or raidz group.
|
|
For more information, see the
|
|
.Sx Cache Devices
|
|
section.
|
|
.El
|
|
.Pp
|
|
Virtual devices cannot be nested arbitrarily.
|
|
A mirror, raidz or draid virtual device can only be created with files or disks.
|
|
Mirrors of mirrors or other such combinations are not allowed.
|
|
.Pp
|
|
A pool can have any number of virtual devices at the top of the configuration
|
|
.Po known as
|
|
.Qq root vdevs
|
|
.Pc .
|
|
Data is dynamically distributed across all top-level devices to balance data
|
|
among devices.
|
|
As new virtual devices are added, ZFS automatically places data on the newly
|
|
available devices.
|
|
.Pp
|
|
Virtual devices are specified one at a time on the command line,
|
|
separated by whitespace.
|
|
Keywords like
|
|
.Sy mirror No and Sy raidz
|
|
are used to distinguish where a group ends and another begins.
|
|
For example, the following creates a pool with two root vdevs,
|
|
each a mirror of two disks:
|
|
.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
|
|
.
|
|
.Ss Device Failure and Recovery
|
|
ZFS supports a rich set of mechanisms for handling device failure and data
|
|
corruption.
|
|
All metadata and data is checksummed, and ZFS automatically repairs bad data
|
|
from a good copy, when corruption is detected.
|
|
.Pp
|
|
In order to take advantage of these features, a pool must make use of some form
|
|
of redundancy, using either mirrored or raidz groups.
|
|
While ZFS supports running in a non-redundant configuration, where each root
|
|
vdev is simply a disk or file, this is strongly discouraged.
|
|
A single case of bit corruption can render some or all of your data unavailable.
|
|
.Pp
|
|
A pool's health status is described by one of three states:
|
|
.Sy online , degraded , No or Sy faulted .
|
|
An online pool has all devices operating normally.
|
|
A degraded pool is one in which one or more devices have failed, but the data is
|
|
still available due to a redundant configuration.
|
|
A faulted pool has corrupted metadata, or one or more faulted devices, and
|
|
insufficient replicas to continue functioning.
|
|
.Pp
|
|
The health of the top-level vdev, such as a mirror or raidz device,
|
|
is potentially impacted by the state of its associated vdevs
|
|
or component devices.
|
|
A top-level vdev or component device is in one of the following states:
|
|
.Bl -tag -width "DEGRADED"
|
|
.It Sy DEGRADED
|
|
One or more top-level vdevs is in the degraded state because one or more
|
|
component devices are offline.
|
|
Sufficient replicas exist to continue functioning.
|
|
.Pp
|
|
One or more component devices is in the degraded or faulted state, but
|
|
sufficient replicas exist to continue functioning.
|
|
The underlying conditions are as follows:
|
|
.Bl -bullet -compact
|
|
.It
|
|
The number of checksum errors or slow I/Os exceeds acceptable levels and the
|
|
device is degraded as an indication that something may be wrong.
|
|
ZFS continues to use the device as necessary.
|
|
.It
|
|
The number of I/O errors exceeds acceptable levels.
|
|
The device could not be marked as faulted because there are insufficient
|
|
replicas to continue functioning.
|
|
.El
|
|
.It Sy FAULTED
|
|
One or more top-level vdevs is in the faulted state because one or more
|
|
component devices are offline.
|
|
Insufficient replicas exist to continue functioning.
|
|
.Pp
|
|
One or more component devices is in the faulted state, and insufficient
|
|
replicas exist to continue functioning.
|
|
The underlying conditions are as follows:
|
|
.Bl -bullet -compact
|
|
.It
|
|
The device could be opened, but the contents did not match expected values.
|
|
.It
|
|
The number of I/O errors exceeds acceptable levels and the device is faulted to
|
|
prevent further use of the device.
|
|
.El
|
|
.It Sy OFFLINE
|
|
The device was explicitly taken offline by the
|
|
.Nm zpool Cm offline
|
|
command.
|
|
.It Sy ONLINE
|
|
The device is online and functioning.
|
|
.It Sy REMOVED
|
|
The device was physically removed while the system was running.
|
|
Device removal detection is hardware-dependent and may not be supported on all
|
|
platforms.
|
|
.It Sy UNAVAIL
|
|
The device could not be opened.
|
|
If a pool is imported when a device was unavailable, then the device will be
|
|
identified by a unique identifier instead of its path since the path was never
|
|
correct in the first place.
|
|
.El
|
|
.Pp
|
|
Checksum errors represent events where a disk returned data that was expected
|
|
to be correct, but was not.
|
|
In other words, these are instances of silent data corruption.
|
|
The checksum errors are reported in
|
|
.Nm zpool Cm status
|
|
and
|
|
.Nm zpool Cm events .
|
|
When a block is stored redundantly, a damaged block may be reconstructed
|
|
(e.g. from raidz parity or a mirrored copy).
|
|
In this case, ZFS reports the checksum error against the disks that contained
|
|
damaged data.
|
|
If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
|
|
in a raidz2 group), it is not possible to determine which disks were silently
|
|
corrupted.
|
|
In this case, checksum errors are reported for all disks on which the block
|
|
is stored.
|
|
.Pp
|
|
If a device is removed and later re-attached to the system,
|
|
ZFS attempts to bring the device online automatically.
|
|
Device attachment detection is hardware-dependent
|
|
and might not be supported on all platforms.
|
|
.
|
|
.Ss Hot Spares
|
|
ZFS allows devices to be associated with pools as
|
|
.Qq hot spares .
|
|
These devices are not actively used in the pool.
|
|
But, when an active device
|
|
fails, it is automatically replaced by a hot spare.
|
|
To create a pool with hot spares, specify a
|
|
.Sy spare
|
|
vdev with any number of devices.
|
|
For example,
|
|
.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
|
|
.Pp
|
|
Spares can be shared across multiple pools, and can be added with the
|
|
.Nm zpool Cm add
|
|
command and removed with the
|
|
.Nm zpool Cm remove
|
|
command.
|
|
Once a spare replacement is initiated, a new
|
|
.Sy spare
|
|
vdev is created within the configuration that will remain there until the
|
|
original device is replaced.
|
|
At this point, the hot spare becomes available again, if another device fails.
|
|
.Pp
|
|
If a pool has a shared spare that is currently being used, the pool cannot be
|
|
exported, since other pools may use this shared spare, which may lead to
|
|
potential data corruption.
|
|
.Pp
|
|
Shared spares add some risk.
|
|
If the pools are imported on different hosts,
|
|
and both pools suffer a device failure at the same time,
|
|
both could attempt to use the spare at the same time.
|
|
This may not be detected, resulting in data corruption.
|
|
.Pp
|
|
An in-progress spare replacement can be cancelled by detaching the hot spare.
|
|
If the original faulted device is detached, then the hot spare assumes its
|
|
place in the configuration, and is removed from the spare list of all active
|
|
pools.
|
|
.Pp
|
|
The
|
|
.Sy draid
|
|
vdev type provides distributed hot spares.
|
|
These hot spares are named after the dRAID vdev they're a part of
|
|
.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
|
|
.No which is a single parity dRAID Pc
|
|
and may only be used by that dRAID vdev.
|
|
Otherwise, they behave the same as normal hot spares.
|
|
.Pp
|
|
Spares cannot replace log devices.
|
|
.
|
|
.Ss Intent Log
|
|
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
|
|
transactions.
|
|
For instance, databases often require their transactions to be on stable storage
|
|
devices when returning from a system call.
|
|
NFS and other applications can also use
|
|
.Xr fsync 2
|
|
to ensure data stability.
|
|
By default, the intent log is allocated from blocks within the main pool.
|
|
However, it might be possible to get better performance using separate intent
|
|
log devices such as NVRAM or a dedicated disk.
|
|
For example:
|
|
.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
|
|
.Pp
|
|
Multiple log devices can also be specified, and they can be mirrored.
|
|
See the
|
|
.Sx EXAMPLES
|
|
section for an example of mirroring multiple log devices.
|
|
.Pp
|
|
Log devices can be added, replaced, attached, detached, and removed.
|
|
In addition, log devices are imported and exported as part of the pool
|
|
that contains them.
|
|
Mirrored devices can be removed by specifying the top-level mirror vdev.
|
|
.
|
|
.Ss Cache Devices
|
|
Devices can be added to a storage pool as
|
|
.Qq cache devices .
|
|
These devices provide an additional layer of caching between main memory and
|
|
disk.
|
|
For read-heavy workloads, where the working set size is much larger than what
|
|
can be cached in main memory, using cache devices allows much more of this
|
|
working set to be served from low latency media.
|
|
Using cache devices provides the greatest performance improvement for random
|
|
read-workloads of mostly static content.
|
|
.Pp
|
|
To create a pool with cache devices, specify a
|
|
.Sy cache
|
|
vdev with any number of devices.
|
|
For example:
|
|
.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
|
|
.Pp
|
|
Cache devices cannot be mirrored or part of a raidz configuration.
|
|
If a read error is encountered on a cache device, that read I/O is reissued to
|
|
the original storage pool device, which might be part of a mirrored or raidz
|
|
configuration.
|
|
.Pp
|
|
The content of the cache devices is persistent across reboots and restored
|
|
asynchronously when importing the pool in L2ARC (persistent L2ARC).
|
|
This can be disabled by setting
|
|
.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
|
|
For cache devices smaller than
|
|
.Em 1 GiB ,
|
|
ZFS does not write the metadata structures
|
|
required for rebuilding the L2ARC, to conserve space.
|
|
This can be changed with
|
|
.Sy l2arc_rebuild_blocks_min_l2size .
|
|
The cache device header
|
|
.Pq Em 512 B
|
|
is updated even if no metadata structures are written.
|
|
Setting
|
|
.Sy l2arc_headroom Ns = Ns Sy 0
|
|
will result in scanning the full-length ARC lists for cacheable content to be
|
|
written in L2ARC (persistent ARC).
|
|
If a cache device is added with
|
|
.Nm zpool Cm add ,
|
|
its label and header will be overwritten and its contents will not be
|
|
restored in L2ARC, even if the device was previously part of the pool.
|
|
If a cache device is onlined with
|
|
.Nm zpool Cm online ,
|
|
its contents will be restored in L2ARC.
|
|
This is useful in case of memory pressure,
|
|
where the contents of the cache device are not fully restored in L2ARC.
|
|
The user can off- and online the cache device when there is less memory
|
|
pressure, to fully restore its contents to L2ARC.
|
|
.
|
|
.Ss Pool checkpoint
|
|
Before starting critical procedures that include destructive actions
|
|
.Pq like Nm zfs Cm destroy ,
|
|
an administrator can checkpoint the pool's state and, in the case of a
|
|
mistake or failure, rewind the entire pool back to the checkpoint.
|
|
Otherwise, the checkpoint can be discarded when the procedure has completed
|
|
successfully.
|
|
.Pp
|
|
A pool checkpoint can be thought of as a pool-wide snapshot and should be used
|
|
with care as it contains every part of the pool's state, from properties to vdev
|
|
configuration.
|
|
Thus, certain operations are not allowed while a pool has a checkpoint.
|
|
Specifically, vdev removal/attach/detach, mirror splitting, and
|
|
changing the pool's GUID.
|
|
Adding a new vdev is supported, but in the case of a rewind it will have to be
|
|
added again.
|
|
Finally, users of this feature should keep in mind that scrubs in a pool that
|
|
has a checkpoint do not repair checkpointed data.
|
|
.Pp
|
|
To create a checkpoint for a pool:
|
|
.Dl # Nm zpool Cm checkpoint Ar pool
|
|
.Pp
|
|
To later rewind to its checkpointed state, you need to first export it and
|
|
then rewind it during import:
|
|
.Dl # Nm zpool Cm export Ar pool
|
|
.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
|
|
.Pp
|
|
To discard the checkpoint from a pool:
|
|
.Dl # Nm zpool Cm checkpoint Fl d Ar pool
|
|
.Pp
|
|
Dataset reservations (controlled by the
|
|
.Sy reservation No and Sy refreservation
|
|
properties) may be unenforceable while a checkpoint exists, because the
|
|
checkpoint is allowed to consume the dataset's reservation.
|
|
Finally, data that is part of the checkpoint but has been freed in the
|
|
current state of the pool won't be scanned during a scrub.
|
|
.
|
|
.Ss Special Allocation Class
|
|
Allocations in the special class are dedicated to specific block types.
|
|
By default, this includes all metadata, the indirect blocks of user data, and
|
|
any deduplication tables.
|
|
The class can also be provisioned to accept small file blocks.
|
|
.Pp
|
|
A pool must always have at least one normal
|
|
.Pq non- Ns Sy dedup Ns /- Ns Sy special
|
|
vdev before
|
|
other devices can be assigned to the special class.
|
|
If the
|
|
.Sy special
|
|
class becomes full, then allocations intended for it
|
|
will spill back into the normal class.
|
|
.Pp
|
|
Deduplication tables can be excluded from the special class by unsetting the
|
|
.Sy zfs_ddt_data_is_special
|
|
ZFS module parameter.
|
|
.Pp
|
|
Inclusion of small file blocks in the special class is opt-in.
|
|
Each dataset can control the size of small file blocks allowed
|
|
in the special class by setting the
|
|
.Sy special_small_blocks
|
|
property to nonzero.
|
|
See
|
|
.Xr zfsprops 7
|
|
for more info on this property.
|