Add TRIM support

UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
This commit is contained in:
Brian Behlendorf
2019-03-29 09:13:20 -07:00
committed by GitHub
parent f94b3cbf43
commit 1b939560be
91 changed files with 5593 additions and 439 deletions
+120 -1
View File
@@ -14,7 +14,7 @@
.\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
.\" own identifying information:
.\" Portions Copyright [yyyy] [name of copyright owner]
.TH ZFS-MODULE-PARAMETERS 5 "Feb 8, 2019"
.TH ZFS-MODULE-PARAMETERS 5 "Feb 15, 2019"
.SH NAME
zfs\-module\-parameters \- ZFS module parameters
.SH DESCRIPTION
@@ -1532,6 +1532,30 @@ See the section "ZFS I/O SCHEDULER".
Default value: \fB10\fR.
.RE
.sp
.ne 2
.na
\fBzfs_vdev_trim_max_active\fR (int)
.ad
.RS 12n
Maximum trim/discard I/Os active to each device.
See the section "ZFS I/O SCHEDULER".
.sp
Default value: \fB2\fR.
.RE
.sp
.ne 2
.na
\fBzfs_vdev_trim_min_active\fR (int)
.ad
.RS 12n
Minimum trim/discard I/Os active to each device.
See the section "ZFS I/O SCHEDULER".
.sp
Default value: \fB1\fR.
.RE
.sp
.ne 2
.na
@@ -1619,6 +1643,12 @@ _
_
512 ZFS_DEBUG_SET_ERROR
Enable SET_ERROR and dprintf entries in the debug log.
_
1024 ZFS_DEBUG_INDIRECT_REMAP
Verify split blocks created by device removal.
_
2048 ZFS_DEBUG_TRIM
Verify TRIM ranges are always within the allocatable range tree.
.TE
.sp
* Requires debug build.
@@ -2341,6 +2371,82 @@ value of 75% will create a maximum of one thread per cpu.
Default value: \fB75\fR%.
.RE
.sp
.ne 2
.na
\fBzfs_trim_extent_bytes_max\fR (unsigned int)
.ad
.RS 12n
Maximum size of TRIM command. Ranges larger than this will be split in to
chunks no larger than \fBzfs_trim_extent_bytes_max\fR bytes before being
issued to the device.
.sp
Default value: \fB134,217,728\fR.
.RE
.sp
.ne 2
.na
\fBzfs_trim_extent_bytes_min\fR (unsigned int)
.ad
.RS 12n
Minimum size of TRIM commands. TRIM ranges smaller than this will be skipped
unless they're part of a larger range which was broken in to chunks. This is
done because it's common for these small TRIMs to negatively impact overall
performance. This value can be set to 0 to TRIM all unallocated space.
.sp
Default value: \fB32,768\fR.
.RE
.sp
.ne 2
.na
\fBzfs_trim_metaslab_skip\fR (unsigned int)
.ad
.RS 12n
Skip uninitialized metaslabs during the TRIM process. This option is useful
for pools constructed from large thinly-provisioned devices where TRIM
operations are slow. As a pool ages an increasing fraction of the pools
metaslabs will be initialized progressively degrading the usefulness of
this option. This setting is stored when starting a manual TRIM and will
persist for the duration of the requested TRIM.
.sp
Default value: \fB0\fR.
.RE
.sp
.ne 2
.na
\fBzfs_trim_queue_limit\fR (unsigned int)
.ad
.RS 12n
Maximum number of queued TRIMs outstanding per leaf vdev. The number of
concurrent TRIM commands issued to the device is controlled by the
\fBzfs_vdev_trim_min_active\fR and \fBzfs_vdev_trim_max_active\fR module
options.
.sp
Default value: \fB10\fR.
.RE
.sp
.ne 2
.na
\fBzfs_trim_txg_batch\fR (unsigned int)
.ad
.RS 12n
The number of transaction groups worth of frees which should be aggregated
before TRIM operations are issued to the device. This setting represents a
trade-off between issuing larger, more efficient TRIM operations and the
delay before the recently trimmed space is available for use by the device.
.sp
Increasing this value will allow frees to be aggregated for a longer time.
This will result is larger TRIM operations and potentially increased memory
usage. Decreasing this value will have the opposite effect. The default
value of 32 was determined to be a reasonable compromise.
.sp
Default value: \fB32\fR.
.RE
.sp
.ne 2
.na
@@ -2364,6 +2470,19 @@ Flush dirty data to disk at least every N seconds (maximum txg duration)
Default value: \fB5\fR.
.RE
.sp
.ne 2
.na
\fBzfs_vdev_aggregate_trim\fR (int)
.ad
.RS 12n
Allow TRIM I/Os to be aggregated. This is normally not helpful because
the extents to be trimmed will have been already been aggregated by the
metaslab. This option is provided for debugging and performance analysis.
.sp
Default value: \fB0\fR.
.RE
.sp
.ne 2
.na
+83 -11
View File
@@ -174,6 +174,13 @@
.Op Fl s | Fl p
.Ar pool Ns ...
.Nm
.Cm trim
.Op Fl d
.Op Fl r Ar rate
.Op Fl c | Fl s
.Ar pool
.Op Ar device Ns ...
.Nm
.Cm set
.Ar property Ns = Ns Ar value
.Ar pool
@@ -187,7 +194,7 @@
.Nm
.Cm status
.Oo Fl c Ar SCRIPT Oc
.Op Fl DigLpPsvx
.Op Fl DigLpPstvx
.Op Fl T Sy u Ns | Ns Sy d
.Oo Ar pool Oc Ns ...
.Op Ar interval Op Ar count
@@ -806,6 +813,28 @@ Any write requests that have yet to be committed to disk would be blocked.
.It Sy panic
Prints out a message to the console and generates a system crash dump.
.El
.It Sy autotrim Ns = Ns Sy on Ns | Ns Sy off
When set to
.Sy on
space which has been recently freed, and is no longer allocated by the pool,
will be periodically trimmed. This allows block device vdevs which support
BLKDISCARD, such as SSDs, or file vdevs on which the underlying file system
supports hole-punching, to reclaim unused blocks. The default setting for
this property is
.Sy off .
.Pp
Automatic TRIM does not immediately reclaim blocks after a free. Instead,
it will optimistically delay allowing smaller ranges to be aggregated in to
a few larger ones. These can then be issued more efficiently to the storage.
.Pp
Be aware that automatic trimming of recently freed data blocks can put
significant stress on the underlying storage devices. This will vary
depending of how well the specific device handles these commands. For
lower end devices it is often possible to achieve most of the benefits
of automatic trimming by running an on-demand (manual) TRIM periodically
using the
.Nm zpool Cm trim
command.
.It Sy feature@ Ns Ar feature_name Ns = Ns Sy enabled
The value of this property is the current state of
.Ar feature_name .
@@ -1782,15 +1811,10 @@ the path. This can be used in conjunction with the
.Fl L
flag.
.It Fl r
Print request size histograms for the leaf ZIOs. This includes
histograms of individual ZIOs (
.Ar ind )
and aggregate ZIOs (
.Ar agg ).
These stats can be useful for seeing how well the ZFS IO aggregator is
working. Do not confuse these request size stats with the block layer
requests; it's possible ZIOs can be broken up before being sent to the
block device.
Print request size histograms for the leaf vdev's IO. This includes
histograms of individual IOs (ind) and aggregate IOs (agg). These stats
can be useful for observing how well IO aggregation is working. Note
that TRIM IOs may exceed 16M, but will be counted as 16M.
.It Fl v
Verbose statistics Reports usage statistics for individual vdevs within the
pool, in addition to the pool-wide statistics.
@@ -1829,6 +1853,8 @@ Average amount of time IO spent in asynchronous priority queues.
Does not include disk time.
.Ar scrub :
Average queuing time in scrub queue. Does not include disk time.
.Ar trim :
Average queuing time in trim queue. Does not include disk time.
.It Fl q
Include active queue statistics. Each priority queue has both
pending (
@@ -1846,6 +1872,8 @@ queues.
Current number of entries in asynchronous priority queues.
.Ar scrubq_read :
Current number of entries in scrub queue.
.Ar trimq_write :
Current number of entries in trim queue.
.Pp
All queue statistics are instantaneous measurements of the number of
entries in the queues. If you specify an interval, the measurements
@@ -2151,6 +2179,48 @@ restarted from the beginning. Any drives that were scheduled for a deferred
resilver will be added to the new one.
.It Xo
.Nm
.Cm trim
.Op Fl d
.Op Fl c | Fl s
.Ar pool
.Op Ar device Ns ...
.Xc
Initiates an immediate on-demand TRIM operation for all of the free space in
a pool. This operation informs the underlying storage devices of all blocks
in the pool which are no longer allocated and allows thinly provisioned
devices to reclaim the space.
.Pp
A manual on-demand TRIM operation can be initiated irrespective of the
.Sy autotrim
pool property setting. See the documentation for the
.Sy autotrim
property above for the types of vdev devices which can be trimmed.
.Bl -tag -width Ds
.It Fl d -secure
Causes a secure TRIM to be initiated. When performing a secure TRIM, the
device guarantees that data stored on the trimmed blocks has been erased.
This requires support from the device and is not supported by all SSDs.
.It Fl r -rate Ar rate
Controls the rate at which the TRIM operation progresses. Without this
option TRIM is executed as quickly as possible. The rate, expressed in bytes
per second, is applied on a per-vdev basis and may be set differently for
each leaf vdev.
.It Fl c, -cancel
Cancel trimming on the specified devices, or all eligible devices if none
are specified.
If one or more target devices are invalid or are not currently being
trimmed, the command will fail and no cancellation will occur on any device.
.It Fl s -suspend
Suspend trimming on the specified devices, or all eligible devices if none
are specified.
If one or more target devices are invalid or are not currently being
trimmed, the command will fail and no suspension will occur on any device.
Trimming can then be resumed by running
.Nm zpool Cm trim
with no flags on the relevant target devices.
.El
.It Xo
.Nm
.Cm set
.Ar property Ns = Ns Ar value
.Ar pool
@@ -2238,7 +2308,7 @@ and automatically import it.
.Nm
.Cm status
.Op Fl c Op Ar SCRIPT1 Ns Oo , Ns Ar SCRIPT2 Oc Ns ...
.Op Fl DigLpPsvx
.Op Fl DigLpPstvx
.Op Fl T Sy u Ns | Ns Sy d
.Oo Ar pool Oc Ns ...
.Op Ar interval Op Ar count
@@ -2295,6 +2365,8 @@ didn't complete in \fBzio_slow_io_ms\fR milliseconds (default 30 seconds).
This does not necessarily mean the IOs failed to complete, just took an
unreasonably long amount of time. This may indicate a problem with the
underlying storage.
.It Fl t
Display vdev TRIM status.
.It Fl T Sy u Ns | Ns Sy d
Display a time stamp.
Specify