mirror of
https://git.proxmox.com/git/mirror_zfs.git
synced 2025-01-07 16:50:26 +03:00
214196e9f5
Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11786
4218 lines
103 KiB
Groff
4218 lines
103 KiB
Groff
'\" te
|
|
.\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
|
|
.\" Copyright (c) 2019, 2020 by Delphix. All rights reserved.
|
|
.\" Copyright (c) 2019 Datto Inc.
|
|
.\" The contents of this file are subject to the terms of the Common Development
|
|
.\" and Distribution License (the "License"). You may not use this file except
|
|
.\" in compliance with the License. You can obtain a copy of the license at
|
|
.\" usr/src/OPENSOLARIS.LICENSE or http://www.opensolaris.org/os/licensing.
|
|
.\"
|
|
.\" See the License for the specific language governing permissions and
|
|
.\" limitations under the License. When distributing Covered Code, include this
|
|
.\" CDDL HEADER in each file and include the License file at
|
|
.\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this
|
|
.\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
|
|
.\" own identifying information:
|
|
.\" Portions Copyright [yyyy] [name of copyright owner]
|
|
.TH ZFS-MODULE-PARAMETERS 5 "Mar 31, 2021" OpenZFS
|
|
.SH NAME
|
|
zfs\-module\-parameters \- ZFS module parameters
|
|
.SH DESCRIPTION
|
|
.sp
|
|
.LP
|
|
Description of the different parameters to the ZFS module.
|
|
|
|
.SS "Module parameters"
|
|
.sp
|
|
.LP
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdbuf_cache_max_bytes\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum size in bytes of the dbuf cache. The target size is determined by the
|
|
MIN versus \fB1/2^dbuf_cache_shift\fR (1/32) of the target ARC size. The
|
|
behavior of the dbuf cache and its associated settings can be observed via the
|
|
\fB/proc/spl/kstat/zfs/dbufstats\fR kstat.
|
|
.sp
|
|
Default value: \fBULONG_MAX\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdbuf_metadata_cache_max_bytes\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum size in bytes of the metadata dbuf cache. The target size is
|
|
determined by the MIN versus \fB1/2^dbuf_metadata_cache_shift\fR (1/64) of the
|
|
target ARC size. The behavior of the metadata dbuf cache and its associated
|
|
settings can be observed via the \fB/proc/spl/kstat/zfs/dbufstats\fR kstat.
|
|
.sp
|
|
Default value: \fBULONG_MAX\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdbuf_cache_hiwater_pct\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
The percentage over \fBdbuf_cache_max_bytes\fR when dbufs must be evicted
|
|
directly.
|
|
.sp
|
|
Default value: \fB10\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdbuf_cache_lowater_pct\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
The percentage below \fBdbuf_cache_max_bytes\fR when the evict thread stops
|
|
evicting dbufs.
|
|
.sp
|
|
Default value: \fB10\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdbuf_cache_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Set the size of the dbuf cache, \fBdbuf_cache_max_bytes\fR, to a log2 fraction
|
|
of the target ARC size.
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdbuf_metadata_cache_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Set the size of the dbuf metadata cache, \fBdbuf_metadata_cache_max_bytes\fR,
|
|
to a log2 fraction of the target ARC size.
|
|
.sp
|
|
Default value: \fB6\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdmu_object_alloc_chunk_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
dnode slots allocated in a single operation as a power of 2. The default value
|
|
minimizes lock contention for the bulk operation performed.
|
|
.sp
|
|
Default value: \fB7\fR (128).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBdmu_prefetch_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Limit the amount we can prefetch with one call to this amount (in bytes).
|
|
This helps to limit the amount of memory that can be used by prefetching.
|
|
.sp
|
|
Default value: \fB134,217,728\fR (128MB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBignore_hole_birth\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This is an alias for \fBsend_holes_without_birth_time\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_feed_again\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Turbo L2ARC warm-up. When the L2ARC is cold the fill interval will be set as
|
|
fast as possible.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR to disable.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_feed_min_ms\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Min feed interval in milliseconds. Requires \fBl2arc_feed_again=1\fR and only
|
|
applicable in related situations.
|
|
.sp
|
|
Default value: \fB200\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_feed_secs\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Seconds between L2ARC writing
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_headroom\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
How far through the ARC lists to search for L2ARC cacheable content, expressed
|
|
as a multiplier of \fBl2arc_write_max\fR.
|
|
ARC persistence across reboots can be achieved with persistent L2ARC by setting
|
|
this parameter to \fB0\fR allowing the full length of ARC lists to be searched
|
|
for cacheable content.
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_headroom_boost\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Scales \fBl2arc_headroom\fR by this percentage when L2ARC contents are being
|
|
successfully compressed before writing. A value of \fB100\fR disables this
|
|
feature.
|
|
.sp
|
|
Default value: \fB200\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_mfuonly\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Controls whether only MFU metadata and data are cached from ARC into L2ARC.
|
|
This may be desired to avoid wasting space on L2ARC when reading/writing large
|
|
amounts of data that are not expected to be accessed more than once. The
|
|
default is \fB0\fR, meaning both MRU and MFU data and metadata are cached.
|
|
When turning off (\fB0\fR) this feature some MRU buffers will still be present
|
|
in ARC and eventually cached on L2ARC.
|
|
.sp
|
|
Use \fB0\fR for no (default) and \fB1\fR for yes.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_meta_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Percent of ARC size allowed for L2ARC-only headers.
|
|
Since L2ARC buffers are not evicted on memory pressure, too large amount of
|
|
headers on system with irrationaly large L2ARC can render it slow or unusable.
|
|
This parameter limits L2ARC writes and rebuild to achieve it.
|
|
.sp
|
|
Default value: \fB33\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_trim_ahead\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Trims ahead of the current write size (\fBl2arc_write_max\fR) on L2ARC devices
|
|
by this percentage of write size if we have filled the device. If set to
|
|
\fB100\fR we TRIM twice the space required to accommodate upcoming writes. A
|
|
minimum of 64MB will be trimmed. It also enables TRIM of the whole L2ARC device
|
|
upon creation or addition to an existing pool or if the header of the device is
|
|
invalid upon importing a pool or onlining a cache device. A value of \fB0\fR
|
|
disables TRIM on L2ARC altogether and is the default as it can put significant
|
|
stress on the underlying storage devices. This will vary depending of how well
|
|
the specific device handles these commands.
|
|
.sp
|
|
Default value: \fB0\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_noprefetch\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Do not write buffers to L2ARC if they were prefetched but not used by
|
|
applications.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR to disable.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_norw\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
No reads during writes.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_write_boost\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Cold L2ARC devices will have \fBl2arc_write_max\fR increased by this amount
|
|
while they remain cold.
|
|
.sp
|
|
Default value: \fB8,388,608\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_write_max\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Max write bytes per interval.
|
|
.sp
|
|
Default value: \fB8,388,608\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_rebuild_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Rebuild the L2ARC when importing a pool (persistent L2ARC). This can be
|
|
disabled if there are problems importing a pool or attaching an L2ARC device
|
|
(e.g. the L2ARC device is slow in reading stored log metadata, or the metadata
|
|
has become somehow fragmented/unusable).
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBl2arc_rebuild_blocks_min_l2size\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Min size (in bytes) of an L2ARC device required in order to write log blocks
|
|
in it. The log blocks are used upon importing the pool to rebuild
|
|
the L2ARC (persistent L2ARC). Rationale: for L2ARC devices less than 1GB, the
|
|
amount of data l2arc_evict() evicts is significant compared to the amount of
|
|
restored L2ARC data. In this case do not write log blocks in L2ARC in order not
|
|
to waste space.
|
|
.sp
|
|
Default value: \fB1,073,741,824\fR (1GB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_aliquot\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Metaslab granularity, in bytes. This is roughly similar to what would be
|
|
referred to as the "stripe size" in traditional RAID arrays. In normal
|
|
operation, ZFS will try to write this amount of data to a top-level vdev
|
|
before moving on to the next one.
|
|
.sp
|
|
Default value: \fB524,288\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_bias_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable metaslab group biasing based on its vdev's over- or under-utilization
|
|
relative to the pool.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_force_ganging\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Make some blocks above a certain size be gang blocks. This option is used
|
|
by the test suite to facilitate testing.
|
|
.sp
|
|
Default value: \fB16,777,217\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_history_output_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When attempting to log the output nvlist of an ioctl in the on-disk history, the
|
|
output will not be stored if it is larger than size (in bytes). This must be
|
|
less then DMU_MAX_ACCESS (64MB). This applies primarily to
|
|
zfs_ioc_channel_program().
|
|
.sp
|
|
Default value: \fB1MB\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_keep_log_spacemaps_at_export\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Prevent log spacemaps from being destroyed during pool exports and destroys.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_metaslab_segment_weight_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable/disable segment-based metaslab selection.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_metaslab_switch_threshold\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When using segment-based metaslab selection, continue allocating
|
|
from the active metaslab until \fBzfs_metaslab_switch_threshold\fR
|
|
worth of buckets have been exhausted.
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_debug_load\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Load all metaslabs during pool import.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_debug_unload\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Prevent metaslabs from being unloaded.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_fragmentation_factor_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable use of the fragmentation metric in computing metaslab weights.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_df_max_search\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum distance to search forward from the last offset. Without this limit,
|
|
fragmented pools can see >100,000 iterations and metaslab_block_picker()
|
|
becomes the performance limiting factor on high-performance storage.
|
|
|
|
With the default setting of 16MB, we typically see less than 500 iterations,
|
|
even with very fragmented, ashift=9 pools. The maximum number of iterations
|
|
possible is: \fBmetaslab_df_max_search / (2 * (1<<ashift))\fR.
|
|
With the default setting of 16MB this is 16*1024 (with ashift=9) or 2048
|
|
(with ashift=12).
|
|
.sp
|
|
Default value: \fB16,777,216\fR (16MB)
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_df_use_largest_segment\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If we are not searching forward (due to metaslab_df_max_search,
|
|
metaslab_df_free_pct, or metaslab_df_alloc_threshold), this tunable controls
|
|
what segment is used. If it is set, we will use the largest free segment.
|
|
If it is not set, we will use a segment of exactly the requested size (or
|
|
larger).
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_metaslab_max_size_cache_sec\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
When we unload a metaslab, we cache the size of the largest free chunk. We use
|
|
that cached size to determine whether or not to load a metaslab for a given
|
|
allocation. As more frees accumulate in that metaslab while it's unloaded, the
|
|
cached max size becomes less and less accurate. After a number of seconds
|
|
controlled by this tunable, we stop considering the cached max size and start
|
|
considering only the histogram instead.
|
|
.sp
|
|
Default value: \fB3600 seconds\fR (one hour)
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_metaslab_mem_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When we are loading a new metaslab, we check the amount of memory being used
|
|
to store metaslab range trees. If it is over a threshold, we attempt to unload
|
|
the least recently used metaslab to prevent the system from clogging all of
|
|
its memory with range trees. This tunable sets the percentage of total system
|
|
memory that is the threshold.
|
|
.sp
|
|
Default value: \fB25 percent\fR
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_default_ms_count\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When a vdev is added target this number of metaslabs per top-level vdev.
|
|
.sp
|
|
Default value: \fB200\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_default_ms_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Default limit for metaslab size.
|
|
.sp
|
|
Default value: \fB29\fR [meaning (1 << 29) = 512MB].
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_max_auto_ashift\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum ashift used when optimizing for logical -> physical sector size on new
|
|
top-level vdevs.
|
|
.sp
|
|
Default value: \fBASHIFT_MAX\fR (16).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_min_auto_ashift\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Minimum ashift used when creating new top-level vdevs.
|
|
.sp
|
|
Default value: \fBASHIFT_MIN\fR (9).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_min_ms_count\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum number of metaslabs to create in a top-level vdev.
|
|
.sp
|
|
Default value: \fB16\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBvdev_validate_skip\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Skip label validation steps during pool import. Changing is not recommended
|
|
unless you know what you are doing and are recovering a damaged label.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_ms_count_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Practical upper limit of total metaslabs per top-level vdev.
|
|
.sp
|
|
Default value: \fB131,072\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_preload_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable metaslab group preloading.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_lba_weighting_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Give more weight to metaslabs with lower LBAs, assuming they have
|
|
greater bandwidth as is typically the case on a modern constant
|
|
angular velocity disk drive.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_unload_delay\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
After a metaslab is used, we keep it loaded for this many txgs, to attempt to
|
|
reduce unnecessary reloading. Note that both this many txgs and
|
|
\fBmetaslab_unload_delay_ms\fR milliseconds must pass before unloading will
|
|
occur.
|
|
.sp
|
|
Default value: \fB32\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBmetaslab_unload_delay_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
After a metaslab is used, we keep it loaded for this many milliseconds, to
|
|
attempt to reduce unnecessary reloading. Note that both this many
|
|
milliseconds and \fBmetaslab_unload_delay\fR txgs must pass before unloading
|
|
will occur.
|
|
.sp
|
|
Default value: \fB600000\fR (ten minutes).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBsend_holes_without_birth_time\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When set, the hole_birth optimization will not be used, and all holes will
|
|
always be sent on zfs send. This is useful if you suspect your datasets are
|
|
affected by a bug in hole_birth.
|
|
.sp
|
|
Use \fB1\fR for on (default) and \fB0\fR for off.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_config_path\fR (charp)
|
|
.ad
|
|
.RS 12n
|
|
SPA config file
|
|
.sp
|
|
Default value: \fB/etc/zfs/zpool.cache\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_asize_inflation\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Multiplication factor used to estimate actual disk consumption from the
|
|
size of data being written. The default value is a worst case estimate,
|
|
but lower values may be valid for a given pool depending on its
|
|
configuration. Pool administrators who understand the factors involved
|
|
may wish to specify a more realistic inflation factor, particularly if
|
|
they operate close to quota or capacity limits.
|
|
.sp
|
|
Default value: \fB24\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_load_print_vdev_tree\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Whether to print the vdev tree in the debugging message buffer during pool import.
|
|
Use 0 to disable and 1 to enable.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_load_verify_data\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Whether to traverse data blocks during an "extreme rewind" (\fB-X\fR)
|
|
import. Use 0 to disable and 1 to enable.
|
|
|
|
An extreme rewind import normally performs a full traversal of all
|
|
blocks in the pool for verification. If this parameter is set to 0,
|
|
the traversal skips non-metadata blocks. It can be toggled once the
|
|
import has started to stop or start the traversal of non-metadata blocks.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_load_verify_metadata\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Whether to traverse blocks during an "extreme rewind" (\fB-X\fR)
|
|
pool import. Use 0 to disable and 1 to enable.
|
|
|
|
An extreme rewind import normally performs a full traversal of all
|
|
blocks in the pool for verification. If this parameter is set to 0,
|
|
the traversal is not performed. It can be toggled once the import has
|
|
started to stop or start the traversal.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_load_verify_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Sets the maximum number of bytes to consume during pool import to the log2
|
|
fraction of the target ARC size.
|
|
.sp
|
|
Default value: \fB4\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBspa_slop_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space
|
|
in the pool to be consumed. This ensures that we don't run the pool
|
|
completely out of space, due to unaccounted changes (e.g. to the MOS).
|
|
It also limits the worst-case time to allocate space. If we have
|
|
less than this amount of free space, most ZPL operations (e.g. write,
|
|
create) will return ENOSPC.
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBvdev_removal_max_span\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
During top-level vdev removal, chunks of data are copied from the vdev
|
|
which may include free space in order to trade bandwidth for IOPS.
|
|
This parameter determines the maximum span of free space (in bytes)
|
|
which will be included as "unnecessary" data in a chunk of copied data.
|
|
|
|
The default value here was chosen to align with
|
|
\fBzfs_vdev_read_gap_limit\fR, which is a similar concept when doing
|
|
regular reads (but there's no reason it has to be the same).
|
|
.sp
|
|
Default value: \fB32,768\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBvdev_file_logical_ashift\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Logical ashift for file-based devices.
|
|
.sp
|
|
Default value: \fB9\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBvdev_file_physical_ashift\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Physical ashift for file-based devices.
|
|
.sp
|
|
Default value: \fB9\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzap_iterate_prefetch\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If this is set, when we start iterating over a ZAP object, zfs will prefetch
|
|
the entire object (all leaf blocks). However, this is limited by
|
|
\fBdmu_prefetch_max\fR.
|
|
.sp
|
|
Use \fB1\fR for on (default) and \fB0\fR for off.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfetch_array_rd_sz\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
If prefetching is enabled, disable prefetching for reads larger than this size.
|
|
.sp
|
|
Default value: \fB1,048,576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfetch_max_distance\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Max bytes to prefetch per stream.
|
|
.sp
|
|
Default value: \fB8,388,608\fR (8MB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfetch_max_idistance\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Max bytes to prefetch indirects for per stream.
|
|
.sp
|
|
Default vaule: \fB67,108,864\fR (64MB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfetch_max_streams\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Max number of streams per zfetch (prefetch streams per file).
|
|
.sp
|
|
Default value: \fB8\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfetch_min_sec_reap\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Min time before an active prefetch stream can be reclaimed
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_abd_scatter_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enables ARC from using scatter/gather lists and forces all allocations to be
|
|
linear in kernel memory. Disabling can improve performance in some code paths
|
|
at the expense of fragmented kernel memory.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_abd_scatter_max_order\fR (iunt)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of consecutive memory pages allocated in a single block for
|
|
scatter/gather lists. Default value is specified by the kernel itself.
|
|
.sp
|
|
Default value: \fB10\fR at the time of this writing.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_abd_scatter_min_size\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
This is the minimum allocation size that will use scatter (page-based)
|
|
ABD's. Smaller allocations will use linear ABD's.
|
|
.sp
|
|
Default value: \fB1536\fR (512B and 1KB allocations will be linear).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_dnode_limit\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
When the number of bytes consumed by dnodes in the ARC exceeds this number of
|
|
bytes, try to unpin some of it in response to demand for non-metadata. This
|
|
value acts as a ceiling to the amount of dnode metadata, and defaults to 0 which
|
|
indicates that a percent which is based on \fBzfs_arc_dnode_limit_percent\fR of
|
|
the ARC meta buffers that may be used for dnodes.
|
|
|
|
See also \fBzfs_arc_meta_prune\fR which serves a similar purpose but is used
|
|
when the amount of metadata in the ARC exceeds \fBzfs_arc_meta_limit\fR rather
|
|
than in response to overall demand for non-metadata.
|
|
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_dnode_limit_percent\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Percentage that can be consumed by dnodes of ARC meta buffers.
|
|
.sp
|
|
See also \fBzfs_arc_dnode_limit\fR which serves a similar purpose but has a
|
|
higher priority if set to nonzero value.
|
|
.sp
|
|
Default value: \fB10\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_dnode_reduce_percent\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Percentage of ARC dnodes to try to scan in response to demand for non-metadata
|
|
when the number of bytes consumed by dnodes exceeds \fBzfs_arc_dnode_limit\fR.
|
|
|
|
.sp
|
|
Default value: \fB10\fR% of the number of dnodes in the ARC.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_average_blocksize\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The ARC's buffer hash table is sized based on the assumption of an average
|
|
block size of \fBzfs_arc_average_blocksize\fR (default 8K). This works out
|
|
to roughly 1MB of hash table per 1GB of physical memory with 8-byte pointers.
|
|
For configurations with a known larger average block size this value can be
|
|
increased to reduce the memory footprint.
|
|
|
|
.sp
|
|
Default value: \fB8192\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_eviction_pct\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When \fBarc_is_overflowing()\fR, \fBarc_get_data_impl()\fR waits for this
|
|
percent of the requested amount of data to be evicted. For example, by
|
|
default for every 2KB that's evicted, 1KB of it may be "reused" by a new
|
|
allocation. Since this is above 100%, it ensures that progress is made
|
|
towards getting \fBarc_size\fR under \fBarc_c\fR. Since this is finite, it
|
|
ensures that allocations can still happen, even during the potentially long
|
|
time that \fBarc_size\fR is more than \fBarc_c\fR.
|
|
.sp
|
|
Default value: \fB200\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_evict_batch_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Number ARC headers to evict per sub-list before proceeding to another sub-list.
|
|
This batch-style operation prevents entire sub-lists from being evicted at once
|
|
but comes at a cost of additional unlocking and locking.
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_grow_retry\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If set to a non zero value, it will replace the arc_grow_retry value with this value.
|
|
The arc_grow_retry value (default 5) is the number of seconds the ARC will wait before
|
|
trying to resume growth after a memory pressure event.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_lotsfree_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Throttle I/O when free system memory drops below this percentage of total
|
|
system memory. Setting this value to 0 will disable the throttle.
|
|
.sp
|
|
Default value: \fB10\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_max\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Max size of ARC in bytes. If set to 0 then the max size of ARC is determined
|
|
by the amount of system memory installed. For Linux, 1/2 of system memory will
|
|
be used as the limit. For FreeBSD, the larger of all system memory - 1GB or
|
|
5/8 of system memory will be used as the limit. This value must be at least
|
|
67108864 (64 megabytes).
|
|
.sp
|
|
This value can be changed dynamically with some caveats. It cannot be set back
|
|
to 0 while running and reducing it below the current ARC size will not cause
|
|
the ARC to shrink without memory pressure to induce shrinking.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_meta_adjust_restarts\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The number of restart passes to make while scanning the ARC attempting
|
|
the free buffers in order to stay below the \fBzfs_arc_meta_limit\fR.
|
|
This value should not need to be tuned but is available to facilitate
|
|
performance analysis.
|
|
.sp
|
|
Default value: \fB4096\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_meta_limit\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The maximum allowed size in bytes that meta data buffers are allowed to
|
|
consume in the ARC. When this limit is reached meta data buffers will
|
|
be reclaimed even if the overall arc_c_max has not been reached. This
|
|
value defaults to 0 which indicates that a percent which is based on
|
|
\fBzfs_arc_meta_limit_percent\fR of the ARC may be used for meta data.
|
|
.sp
|
|
This value my be changed dynamically except that it cannot be set back to 0
|
|
for a specific percent of the ARC; it must be set to an explicit value.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_meta_limit_percent\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Percentage of ARC buffers that can be used for meta data.
|
|
|
|
See also \fBzfs_arc_meta_limit\fR which serves a similar purpose but has a
|
|
higher priority if set to nonzero value.
|
|
|
|
.sp
|
|
Default value: \fB75\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_meta_min\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The minimum allowed size in bytes that meta data buffers may consume in
|
|
the ARC. This value defaults to 0 which disables a floor on the amount
|
|
of the ARC devoted meta data.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_meta_prune\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The number of dentries and inodes to be scanned looking for entries
|
|
which can be dropped. This may be required when the ARC reaches the
|
|
\fBzfs_arc_meta_limit\fR because dentries and inodes can pin buffers
|
|
in the ARC. Increasing this value will cause to dentry and inode caches
|
|
to be pruned more aggressively. Setting this value to 0 will disable
|
|
pruning the inode and dentry caches.
|
|
.sp
|
|
Default value: \fB10,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_meta_strategy\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Define the strategy for ARC meta data buffer eviction (meta reclaim strategy).
|
|
A value of 0 (META_ONLY) will evict only the ARC meta data buffers.
|
|
A value of 1 (BALANCED) indicates that additional data buffers may be evicted if
|
|
that is required to in order to evict the required number of meta data buffers.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_min\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Min size of ARC in bytes. If set to 0 then arc_c_min will default to
|
|
consuming the larger of 32M or 1/32 of total system memory.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_min_prefetch_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum time prefetched blocks are locked in the ARC, specified in ms.
|
|
A value of \fB0\fR will default to 1000 ms.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_min_prescient_prefetch_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum time "prescient prefetched" blocks are locked in the ARC, specified
|
|
in ms. These blocks are meant to be prefetched fairly aggressively ahead of
|
|
the code that may use them. A value of \fB0\fR will default to 6000 ms.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_missing_tvds\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Number of missing top-level vdevs which will be allowed during
|
|
pool import (only in read-only mode).
|
|
.sp
|
|
Default value: \fB0\fR
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_nvlist_src_size\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum size in bytes allowed to be passed as zc_nvlist_src_size for ioctls on
|
|
/dev/zfs. This prevents a user from causing the kernel to allocate an excessive
|
|
amount of memory. When the limit is exceeded, the ioctl fails with EINVAL and a
|
|
description of the error is sent to the zfs-dbgmsg log. This parameter should
|
|
not need to be touched under normal circumstances. On FreeBSD, the default is
|
|
based on the system limit on user wired memory. On Linux, the default is
|
|
\fB128MB\fR.
|
|
.sp
|
|
Default value: \fB0\fR (kernel decides)
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_multilist_num_sublists\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
To allow more fine-grained locking, each ARC state contains a series
|
|
of lists for both data and meta data objects. Locking is performed at
|
|
the level of these "sub-lists". This parameters controls the number of
|
|
sub-lists per ARC state, and also applies to other uses of the
|
|
multilist data structure.
|
|
.sp
|
|
Default value: \fB4\fR or the number of online CPUs, whichever is greater
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_overflow_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The ARC size is considered to be overflowing if it exceeds the current
|
|
ARC target size (arc_c) by a threshold determined by this parameter.
|
|
The threshold is calculated as a fraction of arc_c using the formula
|
|
"arc_c >> \fBzfs_arc_overflow_shift\fR".
|
|
|
|
The default value of 8 causes the ARC to be considered to be overflowing
|
|
if it exceeds the target size by 1/256th (0.3%) of the target size.
|
|
|
|
When the ARC is overflowing, new buffer allocations are stalled until
|
|
the reclaim thread catches up and the overflow condition no longer exists.
|
|
.sp
|
|
Default value: \fB8\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
|
|
\fBzfs_arc_p_min_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If set to a non zero value, this will update arc_p_min_shift (default 4)
|
|
with the new value.
|
|
arc_p_min_shift is used to shift of arc_c for calculating both min and max
|
|
max arc_p
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_p_dampener_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disable arc_p adapt dampener
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR to disable.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_shrink_shift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If set to a non zero value, this will update arc_shrink_shift (default 7)
|
|
with the new value.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_pc_percent\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Percent of pagecache to reclaim arc to
|
|
|
|
This tunable allows ZFS arc to play more nicely with the kernel's LRU
|
|
pagecache. It can guarantee that the ARC size won't collapse under scanning
|
|
pressure on the pagecache, yet still allows arc to be reclaimed down to
|
|
zfs_arc_min if necessary. This value is specified as percent of pagecache
|
|
size (as measured by NR_FILE_PAGES) where that percent may exceed 100. This
|
|
only operates during memory pressure/reclaim.
|
|
.sp
|
|
Default value: \fB0\fR% (disabled).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_shrinker_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This is a limit on how many pages the ARC shrinker makes available for
|
|
eviction in response to one page allocation attempt. Note that in
|
|
practice, the kernel's shrinker can ask us to evict up to about 4x this
|
|
for one allocation attempt.
|
|
.sp
|
|
The default limit of 10,000 (in practice, 160MB per allocation attempt with
|
|
4K pages) limits the amount of time spent attempting to reclaim ARC memory to
|
|
less than 100ms per allocation attempt, even with a small average compressed
|
|
block size of ~8KB.
|
|
.sp
|
|
The parameter can be set to 0 (zero) to disable the limit.
|
|
.sp
|
|
This parameter only applies on Linux.
|
|
.sp
|
|
Default value: \fB10,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_arc_sys_free\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The target number of bytes the ARC should leave as free memory on the system.
|
|
Defaults to the larger of 1/64 of physical memory or 512K. Setting this
|
|
option to a non-zero value will override the default.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_autoimport_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disable pool import at module load by ignoring the cache file (typically \fB/etc/zfs/zpool.cache\fR).
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_checksum_events_per_second\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Rate limit checksum events to this many per second. Note that this should
|
|
not be set below the zed thresholds (currently 10 checksums over 10 sec)
|
|
or else zed may not trigger any action.
|
|
.sp
|
|
Default value: 20
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_commit_timeout_pct\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This controls the amount of time that a ZIL block (lwb) will remain "open"
|
|
when it isn't "full", and it has a thread waiting for it to be committed to
|
|
stable storage. The timeout is scaled based on a percentage of the last lwb
|
|
latency to avoid significantly impacting the latency of each individual
|
|
transaction record (itx).
|
|
.sp
|
|
Default value: \fB5\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_condense_indirect_commit_entry_delay_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Vdev indirection layer (used for device removal) sleeps for this many
|
|
milliseconds during mapping generation. Intended for use with the test suite
|
|
to throttle vdev removal speed.
|
|
.sp
|
|
Default value: \fB0\fR (no throttle).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_condense_indirect_vdevs_enable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable condensing indirect vdev mappings. When set to a non-zero value,
|
|
attempt to condense indirect vdev mappings if the mapping uses more than
|
|
\fBzfs_condense_min_mapping_bytes\fR bytes of memory and if the obsolete
|
|
space map object uses more than \fBzfs_condense_max_obsolete_bytes\fR
|
|
bytes on-disk. The condensing process is an attempt to save memory by
|
|
removing obsolete mappings.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_condense_max_obsolete_bytes\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Only attempt to condense indirect vdev mappings if the on-disk size
|
|
of the obsolete space map object is greater than this number of bytes
|
|
(see \fBfBzfs_condense_indirect_vdevs_enable\fR).
|
|
.sp
|
|
Default value: \fB1,073,741,824\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_condense_min_mapping_bytes\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Minimum size vdev mapping to attempt to condense (see
|
|
\fBzfs_condense_indirect_vdevs_enable\fR).
|
|
.sp
|
|
Default value: \fB131,072\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dbgmsg_enable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Internally ZFS keeps a small log to facilitate debugging. By default the log
|
|
is disabled, to enable it set this option to 1. The contents of the log can
|
|
be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file. Writing 0 to
|
|
this proc file clears the log.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dbgmsg_maxsize\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum size in bytes of the internal ZFS debug log.
|
|
.sp
|
|
Default value: \fB4M\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dbuf_state_index\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This feature is currently unused. It is normally used for controlling what
|
|
reporting is available under /proc/spl/kstat/zfs.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_deadman_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When a pool sync operation takes longer than \fBzfs_deadman_synctime_ms\fR
|
|
milliseconds, or when an individual I/O takes longer than
|
|
\fBzfs_deadman_ziotime_ms\fR milliseconds, then the operation is considered to
|
|
be "hung". If \fBzfs_deadman_enabled\fR is set then the deadman behavior is
|
|
invoked as described by the \fBzfs_deadman_failmode\fR module option.
|
|
By default the deadman is enabled and configured to \fBwait\fR which results
|
|
in "hung" I/Os only being logged. The deadman is automatically disabled
|
|
when a pool gets suspended.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_deadman_failmode\fR (charp)
|
|
.ad
|
|
.RS 12n
|
|
Controls the failure behavior when the deadman detects a "hung" I/O. Valid
|
|
values are \fBwait\fR, \fBcontinue\fR, and \fBpanic\fR.
|
|
.sp
|
|
\fBwait\fR - Wait for a "hung" I/O to complete. For each "hung" I/O a
|
|
"deadman" event will be posted describing that I/O.
|
|
.sp
|
|
\fBcontinue\fR - Attempt to recover from a "hung" I/O by re-dispatching it
|
|
to the I/O pipeline if possible.
|
|
.sp
|
|
\fBpanic\fR - Panic the system. This can be used to facilitate an automatic
|
|
fail-over to a properly configured fail-over partner.
|
|
.sp
|
|
Default value: \fBwait\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_deadman_checktime_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Check time in milliseconds. This defines the frequency at which we check
|
|
for hung I/O and potentially invoke the \fBzfs_deadman_failmode\fR behavior.
|
|
.sp
|
|
Default value: \fB60,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_deadman_synctime_ms\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Interval in milliseconds after which the deadman is triggered and also
|
|
the interval after which a pool sync operation is considered to be "hung".
|
|
Once this limit is exceeded the deadman will be invoked every
|
|
\fBzfs_deadman_checktime_ms\fR milliseconds until the pool sync completes.
|
|
.sp
|
|
Default value: \fB600,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_deadman_ziotime_ms\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Interval in milliseconds after which the deadman is triggered and an
|
|
individual I/O operation is considered to be "hung". As long as the I/O
|
|
remains "hung" the deadman will be invoked every \fBzfs_deadman_checktime_ms\fR
|
|
milliseconds until the I/O completes.
|
|
.sp
|
|
Default value: \fB300,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dedup_prefetch\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable prefetching dedup-ed blks
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR to disable (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_delay_min_dirty_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Start to delay each transaction once there is this amount of dirty data,
|
|
expressed as a percentage of \fBzfs_dirty_data_max\fR.
|
|
This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
|
|
See the section "ZFS TRANSACTION DELAY".
|
|
.sp
|
|
Default value: \fB60\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_delay_scale\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This controls how quickly the transaction delay approaches infinity.
|
|
Larger values cause longer delays for a given amount of dirty data.
|
|
.sp
|
|
For the smoothest delay, this value should be about 1 billion divided
|
|
by the maximum number of operations per second. This will smoothly
|
|
handle between 10x and 1/10th this number.
|
|
.sp
|
|
See the section "ZFS TRANSACTION DELAY".
|
|
.sp
|
|
Note: \fBzfs_delay_scale\fR * \fBzfs_dirty_data_max\fR must be < 2^64.
|
|
.sp
|
|
Default value: \fB500,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_disable_ivset_guid_check\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disables requirement for IVset guids to be present and match when doing a raw
|
|
receive of encrypted datasets. Intended for users whose pools were created with
|
|
OpenZFS pre-release versions and now have compatibility issues.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_key_max_salt_uses\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of uses of a single salt value before generating a new one for
|
|
encrypted datasets. The default value is also the maximum that will be
|
|
accepted.
|
|
.sp
|
|
Default value: \fB400,000,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_object_mutex_size\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Size of the znode hashtable used for holds.
|
|
|
|
Due to the need to hold locks on objects that may not exist yet, kernel mutexes
|
|
are not created per-object and instead a hashtable is used where collisions
|
|
will result in objects waiting when there is not actually contention on the
|
|
same object.
|
|
.sp
|
|
Default value: \fB64\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_slow_io_events_per_second\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Rate limit delay and deadman zevents (which report slow I/Os) to this many per
|
|
second.
|
|
.sp
|
|
Default value: 20
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_unflushed_max_mem_amt\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Upper-bound limit for unflushed metadata changes to be held by the
|
|
log spacemap in memory (in bytes).
|
|
.sp
|
|
Default value: \fB1,073,741,824\fR (1GB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_unflushed_max_mem_ppm\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Percentage of the overall system memory that ZFS allows to be used
|
|
for unflushed metadata changes by the log spacemap.
|
|
(value is calculated over 1000000 for finer granularity).
|
|
.sp
|
|
Default value: \fB1000\fR (which is divided by 1000000, resulting in
|
|
the limit to be \fB0.1\fR% of memory)
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_unflushed_log_block_max\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Describes the maximum number of log spacemap blocks allowed for each pool.
|
|
The default value of 262144 means that the space in all the log spacemaps
|
|
can add up to no more than 262144 blocks (which means 32GB of logical
|
|
space before compression and ditto blocks, assuming that blocksize is
|
|
128k).
|
|
.sp
|
|
This tunable is important because it involves a trade-off between import
|
|
time after an unclean export and the frequency of flushing metaslabs.
|
|
The higher this number is, the more log blocks we allow when the pool is
|
|
active which means that we flush metaslabs less often and thus decrease
|
|
the number of I/Os for spacemap updates per TXG.
|
|
At the same time though, that means that in the event of an unclean export,
|
|
there will be more log spacemap blocks for us to read, inducing overhead
|
|
in the import time of the pool.
|
|
The lower the number, the amount of flushing increases destroying log
|
|
blocks quicker as they become obsolete faster, which leaves less blocks
|
|
to be read during import time after a crash.
|
|
.sp
|
|
Each log spacemap block existing during pool import leads to approximately
|
|
one extra logical I/O issued.
|
|
This is the reason why this tunable is exposed in terms of blocks rather
|
|
than space used.
|
|
.sp
|
|
Default value: \fB262144\fR (256K).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_unflushed_log_block_min\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
If the number of metaslabs is small and our incoming rate is high, we
|
|
could get into a situation that we are flushing all our metaslabs every
|
|
TXG.
|
|
Thus we always allow at least this many log blocks.
|
|
.sp
|
|
Default value: \fB1000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_unflushed_log_block_pct\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Tunable used to determine the number of blocks that can be used for
|
|
the spacemap log, expressed as a percentage of the total number of
|
|
metaslabs in the pool.
|
|
.sp
|
|
Default value: \fB400\fR (read as \fB400\fR% - meaning that the number
|
|
of log spacemap blocks are capped at 4 times the number of
|
|
metaslabs in the pool).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_unlink_suspend_progress\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
When enabled, files will not be asynchronously removed from the list of pending
|
|
unlinks and the space they consume will be leaked. Once this option has been
|
|
disabled and the dataset is remounted, the pending unlinks will be processed
|
|
and the freed space returned to the pool.
|
|
This option is used by the test suite to facilitate testing.
|
|
.sp
|
|
Uses \fB0\fR (default) to allow progress and \fB1\fR to pause progress.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_delete_blocks\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
This is the used to define a large file for the purposes of delete. Files
|
|
containing more than \fBzfs_delete_blocks\fR will be deleted asynchronously
|
|
while smaller files are deleted synchronously. Decreasing this value will
|
|
reduce the time spent in an unlink(2) system call at the expense of a longer
|
|
delay before the freed space is available.
|
|
.sp
|
|
Default value: \fB20,480\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dirty_data_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Determines the dirty space limit in bytes. Once this limit is exceeded, new
|
|
writes are halted until space frees up. This parameter takes precedence
|
|
over \fBzfs_dirty_data_max_percent\fR.
|
|
See the section "ZFS TRANSACTION DELAY".
|
|
.sp
|
|
Default value: \fB10\fR% of physical RAM, capped at \fBzfs_dirty_data_max_max\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dirty_data_max_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed in bytes.
|
|
This limit is only enforced at module load time, and will be ignored if
|
|
\fBzfs_dirty_data_max\fR is later changed. This parameter takes
|
|
precedence over \fBzfs_dirty_data_max_max_percent\fR. See the section
|
|
"ZFS TRANSACTION DELAY".
|
|
.sp
|
|
Default value: \fB25\fR% of physical RAM.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dirty_data_max_max_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed as a
|
|
percentage of physical RAM. This limit is only enforced at module load
|
|
time, and will be ignored if \fBzfs_dirty_data_max\fR is later changed.
|
|
The parameter \fBzfs_dirty_data_max_max\fR takes precedence over this
|
|
one. See the section "ZFS TRANSACTION DELAY".
|
|
.sp
|
|
Default value: \fB25\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dirty_data_max_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Determines the dirty space limit, expressed as a percentage of all
|
|
memory. Once this limit is exceeded, new writes are halted until space frees
|
|
up. The parameter \fBzfs_dirty_data_max\fR takes precedence over this
|
|
one. See the section "ZFS TRANSACTION DELAY".
|
|
.sp
|
|
Default value: \fB10\fR%, subject to \fBzfs_dirty_data_max_max\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dirty_data_sync_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Start syncing out a transaction group if there's at least this much dirty data
|
|
as a percentage of \fBzfs_dirty_data_max\fR. This should be less than
|
|
\fBzfs_vdev_async_write_active_min_dirty_percent\fR.
|
|
.sp
|
|
Default value: \fB20\fR% of \fBzfs_dirty_data_max\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_fallocate_reserve_percent\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Since ZFS is a copy-on-write filesystem with snapshots, blocks cannot be
|
|
preallocated for a file in order to guarantee that later writes will not
|
|
run out of space. Instead, fallocate() space preallocation only checks
|
|
that sufficient space is currently available in the pool or the user's
|
|
project quota allocation, and then creates a sparse file of the requested
|
|
size. The requested space is multiplied by \fBzfs_fallocate_reserve_percent\fR
|
|
to allow additional space for indirect blocks and other internal metadata.
|
|
Setting this value to 0 disables support for fallocate(2) and returns
|
|
EOPNOTSUPP for fallocate() space preallocation again.
|
|
.sp
|
|
Default value: \fB110\fR%
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_fletcher_4_impl\fR (string)
|
|
.ad
|
|
.RS 12n
|
|
Select a fletcher 4 implementation.
|
|
.sp
|
|
Supported selectors are: \fBfastest\fR, \fBscalar\fR, \fBsse2\fR, \fBssse3\fR,
|
|
\fBavx2\fR, \fBavx512f\fR, \fBavx512bw\fR, and \fBaarch64_neon\fR.
|
|
All of the selectors except \fBfastest\fR and \fBscalar\fR require instruction
|
|
set extensions to be available and will only appear if ZFS detects that they are
|
|
present at runtime. If multiple implementations of fletcher 4 are available,
|
|
the \fBfastest\fR will be chosen using a micro benchmark. Selecting \fBscalar\fR
|
|
results in the original, CPU based calculation, being used. Selecting any option
|
|
other than \fBfastest\fR and \fBscalar\fR results in vector instructions from
|
|
the respective CPU instruction set being used.
|
|
.sp
|
|
Default value: \fBfastest\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_free_bpobj_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable/disable the processing of the free_bpobj object.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_async_block_max_blocks\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of blocks freed in a single txg.
|
|
.sp
|
|
Default value: \fBULONG_MAX\fR (unlimited).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_async_dedup_frees\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of dedup blocks freed in a single txg.
|
|
.sp
|
|
Default value: \fB100,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_override_estimate_recordsize\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Record size calculation override for zfs send estimates.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_async_read_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum asynchronous read I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB3\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_async_read_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum asynchronous read I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_async_write_active_max_dirty_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When the pool has more than
|
|
\fBzfs_vdev_async_write_active_max_dirty_percent\fR dirty data, use
|
|
\fBzfs_vdev_async_write_max_active\fR to limit active async writes. If
|
|
the dirty data is between min and max, the active I/O limit is linearly
|
|
interpolated. See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB60\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_async_write_active_min_dirty_percent\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When the pool has less than
|
|
\fBzfs_vdev_async_write_active_min_dirty_percent\fR dirty data, use
|
|
\fBzfs_vdev_async_write_min_active\fR to limit active async writes. If
|
|
the dirty data is between min and max, the active I/O limit is linearly
|
|
interpolated. See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB30\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_async_write_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum asynchronous write I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_async_write_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum asynchronous write I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Lower values are associated with better latency on rotational media but poorer
|
|
resilver performance. The default value of 2 was chosen as a compromise. A
|
|
value of 3 has been shown to improve resilver performance further at a cost of
|
|
further increasing latency.
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_initializing_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum initializing I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_initializing_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum initializing I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum number of I/Os active to each device. Ideally, this will be >=
|
|
the sum of each queue's max_active. See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_rebuild_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum sequential resilver I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB3\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_rebuild_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum sequential resilver I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_removal_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum removal I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_removal_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum removal I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_scrub_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum scrub I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_scrub_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum scrub I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_sync_read_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum synchronous read I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_sync_read_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum synchronous read I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_sync_write_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum synchronous write I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_sync_write_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum synchronous write I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_trim_max_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum trim/discard I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_trim_min_active\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Minimum trim/discard I/Os active to each device.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_nia_delay\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
|
|
the number of concurrently-active I/O's is limited to *_min_active, unless
|
|
the vdev is "idle". When there are no interactive I/Os active (sync or
|
|
async), and zfs_vdev_nia_delay I/Os have completed since the last
|
|
interactive I/O, then the vdev is considered to be "idle", and the number
|
|
of concurrently-active non-interactive I/O's is increased to *_max_active.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_nia_credit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Some HDDs tend to prioritize sequential I/O so high, that concurrent
|
|
random I/O latency reaches several seconds. On some HDDs it happens
|
|
even if sequential I/Os are submitted one at a time, and so setting
|
|
*_max_active to 1 does not help. To prevent non-interactive I/Os, like
|
|
scrub, from monopolizing the device no more than zfs_vdev_nia_credit
|
|
I/Os can be sent while there are outstanding incomplete interactive
|
|
I/Os. This enforced wait ensures the HDD services the interactive I/O
|
|
within a reasonable amount of time.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_queue_depth_pct\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of queued allocations per top-level vdev expressed as
|
|
a percentage of \fBzfs_vdev_async_write_max_active\fR which allows the
|
|
system to detect devices that are more capable of handling allocations
|
|
and to allocate more blocks to those devices. It allows for dynamic
|
|
allocation distribution when devices are imbalanced as fuller devices
|
|
will tend to be slower than empty devices.
|
|
|
|
See also \fBzio_dva_throttle_enabled\fR.
|
|
.sp
|
|
Default value: \fB1000\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_expire_snapshot\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Seconds to expire .zfs/snapshot
|
|
.sp
|
|
Default value: \fB300\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_admin_snapshot\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Allow the creation, removal, or renaming of entries in the .zfs/snapshot
|
|
directory to cause the creation, destruction, or renaming of snapshots.
|
|
When enabled this functionality works both locally and over NFS exports
|
|
which have the 'no_root_squash' option set. This functionality is disabled
|
|
by default.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_flags\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Set additional debugging flags. The following flags may be bitwise-or'd
|
|
together.
|
|
.sp
|
|
.TS
|
|
box;
|
|
rB lB
|
|
lB lB
|
|
r l.
|
|
Value Symbolic Name
|
|
Description
|
|
_
|
|
1 ZFS_DEBUG_DPRINTF
|
|
Enable dprintf entries in the debug log.
|
|
_
|
|
2 ZFS_DEBUG_DBUF_VERIFY *
|
|
Enable extra dbuf verifications.
|
|
_
|
|
4 ZFS_DEBUG_DNODE_VERIFY *
|
|
Enable extra dnode verifications.
|
|
_
|
|
8 ZFS_DEBUG_SNAPNAMES
|
|
Enable snapshot name verification.
|
|
_
|
|
16 ZFS_DEBUG_MODIFY
|
|
Check for illegally modified ARC buffers.
|
|
_
|
|
64 ZFS_DEBUG_ZIO_FREE
|
|
Enable verification of block frees.
|
|
_
|
|
128 ZFS_DEBUG_HISTOGRAM_VERIFY
|
|
Enable extra spacemap histogram verifications.
|
|
_
|
|
256 ZFS_DEBUG_METASLAB_VERIFY
|
|
Verify space accounting on disk matches in-core range_trees.
|
|
_
|
|
512 ZFS_DEBUG_SET_ERROR
|
|
Enable SET_ERROR and dprintf entries in the debug log.
|
|
_
|
|
1024 ZFS_DEBUG_INDIRECT_REMAP
|
|
Verify split blocks created by device removal.
|
|
_
|
|
2048 ZFS_DEBUG_TRIM
|
|
Verify TRIM ranges are always within the allocatable range tree.
|
|
_
|
|
4096 ZFS_DEBUG_LOG_SPACEMAP
|
|
Verify that the log summary is consistent with the spacemap log
|
|
and enable zfs_dbgmsgs for metaslab loading and flushing.
|
|
.TE
|
|
.sp
|
|
* Requires debug build.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_free_leak_on_eio\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If destroy encounters an EIO while reading metadata (e.g. indirect
|
|
blocks), space referenced by the missing metadata can not be freed.
|
|
Normally this causes the background destroy to become "stalled", as
|
|
it is unable to make forward progress. While in this stalled state,
|
|
all remaining space to free from the error-encountering filesystem is
|
|
"temporarily leaked". Set this flag to cause it to ignore the EIO,
|
|
permanently leak the space from indirect blocks that can not be read,
|
|
and continue to free everything else that it can.
|
|
|
|
The default, "stalling" behavior is useful if the storage partially
|
|
fails (i.e. some but not all i/os fail), and then later recovers. In
|
|
this case, we will be able to continue pool operations while it is
|
|
partially failed, and when it recovers, we can continue to free the
|
|
space, with no leaks. However, note that this case is actually
|
|
fairly rare.
|
|
|
|
Typically pools either (a) fail completely (but perhaps temporarily,
|
|
e.g. a top-level vdev going offline), or (b) have localized,
|
|
permanent errors (e.g. disk returns the wrong data due to bit flip or
|
|
firmware bug). In case (a), this setting does not matter because the
|
|
pool will be suspended and the sync thread will not be able to make
|
|
forward progress regardless. In case (b), because the error is
|
|
permanent, the best we can do is leak the minimum amount of space,
|
|
which is what setting this flag will do. Therefore, it is reasonable
|
|
for this flag to normally be set, but we chose the more conservative
|
|
approach of not setting it, so that there is no possibility of
|
|
leaking space in the "partial temporary" failure case.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_free_min_time_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
During a \fBzfs destroy\fR operation using \fBfeature@async_destroy\fR a minimum
|
|
of this much time will be spent working on freeing blocks per txg.
|
|
.sp
|
|
Default value: \fB1,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_obsolete_min_time_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Similar to \fBzfs_free_min_time_ms\fR but for cleanup of old indirection records
|
|
for removed vdevs.
|
|
.sp
|
|
Default value: \fB500\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_immediate_write_sz\fR (long)
|
|
.ad
|
|
.RS 12n
|
|
Largest data block to write to zil. Larger blocks will be treated as if the
|
|
dataset being written to had the property setting \fBlogbias=throughput\fR.
|
|
.sp
|
|
Default value: \fB32,768\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_initialize_value\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Pattern written to vdev free space by \fBzpool initialize\fR.
|
|
.sp
|
|
Default value: \fB16,045,690,984,833,335,022\fR (0xdeadbeefdeadbeee).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_initialize_chunk_size\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Size of writes used by \fBzpool initialize\fR.
|
|
This option is used by the test suite to facilitate testing.
|
|
.sp
|
|
Default value: \fB1,048,576\fR
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_max_entries\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The threshold size (in block pointers) at which we create a new sub-livelist.
|
|
Larger sublists are more costly from a memory perspective but the fewer
|
|
sublists there are, the lower the cost of insertion.
|
|
.sp
|
|
Default value: \fB500,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_min_percent_shared\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If the amount of shared space between a snapshot and its clone drops below
|
|
this threshold, the clone turns off the livelist and reverts to the old deletion
|
|
method. This is in place because once a clone has been overwritten enough
|
|
livelists no long give us a benefit.
|
|
.sp
|
|
Default value: \fB75\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_condense_new_alloc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Incremented each time an extra ALLOC blkptr is added to a livelist entry while
|
|
it is being condensed.
|
|
This option is used by the test suite to track race conditions.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_condense_sync_cancel\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Incremented each time livelist condensing is canceled while in
|
|
spa_livelist_condense_sync.
|
|
This option is used by the test suite to track race conditions.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_condense_sync_pause\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When set, the livelist condense process pauses indefinitely before
|
|
executing the synctask - spa_livelist_condense_sync.
|
|
This option is used by the test suite to trigger race conditions.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_condense_zthr_cancel\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Incremented each time livelist condensing is canceled while in
|
|
spa_livelist_condense_cb.
|
|
This option is used by the test suite to track race conditions.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_livelist_condense_zthr_pause\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When set, the livelist condense process pauses indefinitely before
|
|
executing the open context condensing work in spa_livelist_condense_cb.
|
|
This option is used by the test suite to trigger race conditions.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_lua_max_instrlimit\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The maximum execution time limit that can be set for a ZFS channel program,
|
|
specified as a number of Lua instructions.
|
|
.sp
|
|
Default value: \fB100,000,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_lua_max_memlimit\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The maximum memory limit that can be set for a ZFS channel program, specified
|
|
in bytes.
|
|
.sp
|
|
Default value: \fB104,857,600\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_dataset_nesting\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum depth of nested datasets. This value can be tuned temporarily to
|
|
fix existing datasets that exceed the predefined limit.
|
|
.sp
|
|
Default value: \fB50\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_log_walking\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
The number of past TXGs that the flushing algorithm of the log spacemap
|
|
feature uses to estimate incoming log blocks.
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_logsm_summary_length\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of rows allowed in the summary of the spacemap log.
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_max_recordsize\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
We currently support block sizes from 512 bytes to 16MB. The benefits of
|
|
larger blocks, and thus larger I/O, need to be weighed against the cost of
|
|
COWing a giant block to modify one byte. Additionally, very large blocks
|
|
can have an impact on i/o latency, and also potentially on the memory
|
|
allocator. Therefore, we do not allow the recordsize to be set larger than
|
|
zfs_max_recordsize (default 1MB). Larger blocks can be created by changing
|
|
this tunable, and pools with larger blocks can always be imported and used,
|
|
regardless of this setting.
|
|
.sp
|
|
Default value: \fB1,048,576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_allow_redacted_dataset_mount\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Allow datasets received with redacted send/receive to be mounted. Normally
|
|
disabled because these datasets may be missing key data.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_min_metaslabs_to_flush\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Minimum number of metaslabs to flush per dirty TXG
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_metaslab_fragmentation_threshold\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Allow metaslabs to keep their active state as long as their fragmentation
|
|
percentage is less than or equal to this value. An active metaslab that
|
|
exceeds this threshold will no longer keep its active status allowing
|
|
better metaslabs to be selected.
|
|
.sp
|
|
Default value: \fB70\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_mg_fragmentation_threshold\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Metaslab groups are considered eligible for allocations if their
|
|
fragmentation metric (measured as a percentage) is less than or equal to
|
|
this value. If a metaslab group exceeds this threshold then it will be
|
|
skipped unless all metaslab groups within the metaslab class have also
|
|
crossed this threshold.
|
|
.sp
|
|
Default value: \fB95\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_mg_noalloc_threshold\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Defines a threshold at which metaslab groups should be eligible for
|
|
allocations. The value is expressed as a percentage of free space
|
|
beyond which a metaslab group is always eligible for allocations.
|
|
If a metaslab group's free space is less than or equal to the
|
|
threshold, the allocator will avoid allocating to that group
|
|
unless all groups in the pool have reached the threshold. Once all
|
|
groups have reached the threshold, all groups are allowed to accept
|
|
allocations. The default value of 0 disables the feature and causes
|
|
all metaslab groups to be eligible for allocations.
|
|
|
|
This parameter allows one to deal with pools having heavily imbalanced
|
|
vdevs such as would be the case when a new vdev has been added.
|
|
Setting the threshold to a non-zero percentage will stop allocations
|
|
from being made to vdevs that aren't filled to the specified percentage
|
|
and allow lesser filled vdevs to acquire more allocations than they
|
|
otherwise would under the old \fBzfs_mg_alloc_failures\fR facility.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_ddt_data_is_special\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If enabled, ZFS will place DDT data into the special allocation class.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_user_indirect_is_special\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If enabled, ZFS will place user data (both file and zvol) indirect blocks
|
|
into the special allocation class.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_multihost_history\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Historical statistics for the last N multihost updates will be available in
|
|
\fB/proc/spl/kstat/zfs/<pool>/multihost\fR
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_multihost_interval\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Used to control the frequency of multihost writes which are performed when the
|
|
\fBmultihost\fR pool property is on. This is one factor used to determine the
|
|
length of the activity check during import.
|
|
.sp
|
|
The multihost write period is \fBzfs_multihost_interval / leaf-vdevs\fR
|
|
milliseconds. On average a multihost write will be issued for each leaf vdev
|
|
every \fBzfs_multihost_interval\fR milliseconds. In practice, the observed
|
|
period can vary with the I/O load and this observed value is the delay which is
|
|
stored in the uberblock.
|
|
.sp
|
|
Default value: \fB1000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_multihost_import_intervals\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Used to control the duration of the activity test on import. Smaller values of
|
|
\fBzfs_multihost_import_intervals\fR will reduce the import time but increase
|
|
the risk of failing to detect an active pool. The total activity check time is
|
|
never allowed to drop below one second.
|
|
.sp
|
|
On import the activity check waits a minimum amount of time determined by
|
|
\fBzfs_multihost_interval * zfs_multihost_import_intervals\fR, or the same
|
|
product computed on the host which last had the pool imported (whichever is
|
|
greater). The activity check time may be further extended if the value of mmp
|
|
delay found in the best uberblock indicates actual multihost updates happened
|
|
at longer intervals than \fBzfs_multihost_interval\fR. A minimum value of
|
|
\fB100ms\fR is enforced.
|
|
.sp
|
|
A value of 0 is ignored and treated as if it was set to 1.
|
|
.sp
|
|
Default value: \fB20\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_multihost_fail_intervals\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Controls the behavior of the pool when multihost write failures or delays are
|
|
detected.
|
|
.sp
|
|
When \fBzfs_multihost_fail_intervals = 0\fR, multihost write failures or delays
|
|
are ignored. The failures will still be reported to the ZED which depending on
|
|
its configuration may take action such as suspending the pool or offlining a
|
|
device.
|
|
|
|
.sp
|
|
When \fBzfs_multihost_fail_intervals > 0\fR, the pool will be suspended if
|
|
\fBzfs_multihost_fail_intervals * zfs_multihost_interval\fR milliseconds pass
|
|
without a successful mmp write. This guarantees the activity test will see
|
|
mmp writes if the pool is imported. A value of 1 is ignored and treated as
|
|
if it was set to 2. This is necessary to prevent the pool from being suspended
|
|
due to normal, small I/O latency variations.
|
|
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_no_scrub_io\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Set for no scrub I/O. This results in scrubs not actually scrubbing data and
|
|
simply doing a metadata crawl of the pool instead.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_no_scrub_prefetch\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Set to disable block prefetching for scrubs.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_nocacheflush\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disable cache flush operations on disks when writing. Setting this will
|
|
cause pool corruption on power loss if a volatile out-of-order write cache
|
|
is enabled.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_nopwrite_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable NOP writes
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR to disable.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_dmu_offset_next_sync\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enable forcing txg sync to find holes. When enabled forces ZFS to act
|
|
like prior versions when SEEK_HOLE or SEEK_DATA flags are used, which
|
|
when a dnode is dirty causes txg's to be synced so that this data can be
|
|
found.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR to disable (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_pd_bytes_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The number of bytes which should be prefetched during a pool traversal
|
|
(eg: \fBzfs send\fR or other data crawling operations)
|
|
.sp
|
|
Default value: \fB52,428,800\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_per_txg_dirty_frees_percent \fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Tunable to control percentage of dirtied indirect blocks from frees allowed
|
|
into one TXG. After this threshold is crossed, additional frees will wait until
|
|
the next TXG.
|
|
A value of zero will disable this throttle.
|
|
.sp
|
|
Default value: \fB5\fR, set to \fB0\fR to disable.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_prefetch_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This tunable disables predictive prefetch. Note that it leaves "prescient"
|
|
prefetch (e.g. prefetch for zfs send) intact. Unlike predictive prefetch,
|
|
prescient prefetch never issues i/os that end up not being needed, so it
|
|
can't hurt performance.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_qat_checksum_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This tunable disables qat hardware acceleration for sha256 checksums. It
|
|
may be set after the zfs modules have been loaded to initialize the qat
|
|
hardware as long as support is compiled in and the qat driver is present.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_qat_compress_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This tunable disables qat hardware acceleration for gzip compression. It
|
|
may be set after the zfs modules have been loaded to initialize the qat
|
|
hardware as long as support is compiled in and the qat driver is present.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_qat_encrypt_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This tunable disables qat hardware acceleration for AES-GCM encryption. It
|
|
may be set after the zfs modules have been loaded to initialize the qat
|
|
hardware as long as support is compiled in and the qat driver is present.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_read_chunk_size\fR (long)
|
|
.ad
|
|
.RS 12n
|
|
Bytes to read per chunk
|
|
.sp
|
|
Default value: \fB1,048,576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_read_history\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Historical statistics for the last N reads will be available in
|
|
\fB/proc/spl/kstat/zfs/<pool>/reads\fR
|
|
.sp
|
|
Default value: \fB0\fR (no data is kept).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_read_history_hits\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Include cache hits in read history
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_rebuild_max_segment\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Maximum read segment size to issue when sequentially resilvering a
|
|
top-level vdev.
|
|
.sp
|
|
Default value: \fB1,048,576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_reconstruct_indirect_combinations_max\fR (int)
|
|
.ad
|
|
.RS 12na
|
|
If an indirect split block contains more than this many possible unique
|
|
combinations when being reconstructed, consider it too computationally
|
|
expensive to check them all. Instead, try at most
|
|
\fBzfs_reconstruct_indirect_combinations_max\fR randomly-selected
|
|
combinations each time the block is accessed. This allows all segment
|
|
copies to participate fairly in the reconstruction when all combinations
|
|
cannot be checked and prevents repeated use of one bad copy.
|
|
.sp
|
|
Default value: \fB4096\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_recover\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Set to attempt to recover from fatal errors. This should only be used as a
|
|
last resort, as it typically results in leaked space, or worse.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_removal_ignore_errors\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
.sp
|
|
Ignore hard IO errors during device removal. When set, if a device encounters
|
|
a hard IO error during the removal process the removal will not be cancelled.
|
|
This can result in a normally recoverable block becoming permanently damaged
|
|
and is not recommended. This should only be used as a last resort when the
|
|
pool cannot be returned to a healthy state prior to removing the device.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_removal_suspend_progress\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
.sp
|
|
This is used by the test suite so that it can ensure that certain actions
|
|
happen while in the middle of a removal.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_remove_max_segment\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
.sp
|
|
The largest contiguous segment that we will attempt to allocate when removing
|
|
a device. This can be no larger than 16MB. If there is a performance
|
|
problem with attempting to allocate large blocks, consider decreasing this.
|
|
.sp
|
|
Default value: \fB16,777,216\fR (16MB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_resilver_disable_defer\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disables the \fBresilver_defer\fR feature, causing an operation that would
|
|
start a resilver to restart one in progress immediately.
|
|
.sp
|
|
Default value: \fB0\fR (feature enabled).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_resilver_min_time_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Resilvers are processed by the sync thread. While resilvering it will spend
|
|
at least this much time working on a resilver between txg flushes.
|
|
.sp
|
|
Default value: \fB3,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_ignore_errors\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If set to a nonzero value, remove the DTL (dirty time list) upon
|
|
completion of a pool scan (scrub) even if there were unrepairable
|
|
errors. It is intended to be used during pool repair or recovery to
|
|
stop resilvering when the pool is next imported.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scrub_min_time_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Scrubs are processed by the sync thread. While scrubbing it will spend
|
|
at least this much time working on a scrub between txg flushes.
|
|
.sp
|
|
Default value: \fB1,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_checkpoint_intval\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
To preserve progress across reboots the sequential scan algorithm periodically
|
|
needs to stop metadata scanning and issue all the verifications I/Os to disk.
|
|
The frequency of this flushing is determined by the
|
|
\fBzfs_scan_checkpoint_intval\fR tunable.
|
|
.sp
|
|
Default value: \fB7200\fR seconds (every 2 hours).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_fill_weight\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This tunable affects how scrub and resilver I/O segments are ordered. A higher
|
|
number indicates that we care more about how filled in a segment is, while a
|
|
lower number indicates we care more about the size of the extent without
|
|
considering the gaps within a segment. This value is only tunable upon module
|
|
insertion. Changing the value afterwards will have no affect on scrub or
|
|
resilver performance.
|
|
.sp
|
|
Default value: \fB3\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_issue_strategy\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Determines the order that data will be verified while scrubbing or resilvering.
|
|
If set to \fB1\fR, data will be verified as sequentially as possible, given the
|
|
amount of memory reserved for scrubbing (see \fBzfs_scan_mem_lim_fact\fR). This
|
|
may improve scrub performance if the pool's data is very fragmented. If set to
|
|
\fB2\fR, the largest mostly-contiguous chunk of found data will be verified
|
|
first. By deferring scrubbing of small segments, we may later find adjacent data
|
|
to coalesce and increase the segment size. If set to \fB0\fR, zfs will use
|
|
strategy \fB1\fR during normal verification and strategy \fB2\fR while taking a
|
|
checkpoint.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_legacy\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
A value of 0 indicates that scrubs and resilvers will gather metadata in
|
|
memory before issuing sequential I/O. A value of 1 indicates that the legacy
|
|
algorithm will be used where I/O is initiated as soon as it is discovered.
|
|
Changing this value to 0 will not affect scrubs or resilvers that are already
|
|
in progress.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_max_ext_gap\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Indicates the largest gap in bytes between scrub / resilver I/Os that will still
|
|
be considered sequential for sorting purposes. Changing this value will not
|
|
affect scrubs or resilvers that are already in progress.
|
|
.sp
|
|
Default value: \fB2097152 (2 MB)\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_mem_lim_fact\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum fraction of RAM used for I/O sorting by sequential scan algorithm.
|
|
This tunable determines the hard limit for I/O sorting memory usage.
|
|
When the hard limit is reached we stop scanning metadata and start issuing
|
|
data verification I/O. This is done until we get below the soft limit.
|
|
.sp
|
|
Default value: \fB20\fR which is 5% of RAM (1/20).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_mem_lim_soft_fact\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The fraction of the hard limit used to determined the soft limit for I/O sorting
|
|
by the sequential scan algorithm. When we cross this limit from below no action
|
|
is taken. When we cross this limit from above it is because we are issuing
|
|
verification I/O. In this case (unless the metadata scan is done) we stop
|
|
issuing verification I/O and start scanning metadata again until we get to the
|
|
hard limit.
|
|
.sp
|
|
Default value: \fB20\fR which is 5% of the hard limit (1/20).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_strict_mem_lim\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Enforces tight memory limits on pool scans when a sequential scan is in
|
|
progress. When disabled the memory limit may be exceeded by fast disks.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_suspend_progress\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Freezes a scrub/resilver in progress without actually pausing it. Intended for
|
|
testing/debugging.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_scan_vdev_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum amount of data that can be concurrently issued at once for scrubs and
|
|
resilvers per leaf device, given in bytes.
|
|
.sp
|
|
Default value: \fB41943040\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_send_corrupt_data\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Allow sending of corrupt data (ignore read/checksum errors when sending data)
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_send_unmodified_spill_blocks\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Include unmodified spill blocks in the send stream. Under certain circumstances
|
|
previous versions of ZFS could incorrectly remove the spill block from an
|
|
existing object. Including unmodified copies of the spill blocks creates a
|
|
backwards compatible stream which will recreate a spill block if it was
|
|
incorrectly removed.
|
|
.sp
|
|
Use \fB1\fR for yes (default) and \fB0\fR for no.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_send_no_prefetch_queue_ff\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The fill fraction of the \fBzfs send\fR internal queues. The fill fraction
|
|
controls the timing with which internal threads are woken up.
|
|
.sp
|
|
Default value: \fB20\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_send_no_prefetch_queue_length\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum number of bytes allowed in \fBzfs send\fR's internal queues.
|
|
.sp
|
|
Default value: \fB1,048,576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_send_queue_ff\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The fill fraction of the \fBzfs send\fR prefetch queue. The fill fraction
|
|
controls the timing with which internal threads are woken up.
|
|
.sp
|
|
Default value: \fB20\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_send_queue_length\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum number of bytes allowed that will be prefetched by \fBzfs send\fR.
|
|
This value must be at least twice the maximum block size in use.
|
|
.sp
|
|
Default value: \fB16,777,216\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_recv_queue_ff\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The fill fraction of the \fBzfs receive\fR queue. The fill fraction
|
|
controls the timing with which internal threads are woken up.
|
|
.sp
|
|
Default value: \fB20\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_recv_queue_length\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum number of bytes allowed in the \fBzfs receive\fR queue. This value
|
|
must be at least twice the maximum block size in use.
|
|
.sp
|
|
Default value: \fB16,777,216\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_recv_write_batch_size\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum amount of data (in bytes) that \fBzfs receive\fR will write in
|
|
one DMU transaction. This is the uncompressed size, even when receiving a
|
|
compressed send stream. This setting will not reduce the write size below
|
|
a single block. Capped at a maximum of 32MB
|
|
.sp
|
|
Default value: \fB1MB\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_override_estimate_recordsize\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Setting this variable overrides the default logic for estimating block
|
|
sizes when doing a zfs send. The default heuristic is that the average
|
|
block size will be the current recordsize. Override this value if most data
|
|
in your dataset is not of that size and you require accurate zfs send size
|
|
estimates.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_sync_pass_deferred_free\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Flushing of data to disk is done in passes. Defer frees starting in this pass
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_spa_discard_memory_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum memory used for prefetching a checkpoint's space map on each
|
|
vdev while discarding the checkpoint.
|
|
.sp
|
|
Default value: \fB16,777,216\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_special_class_metadata_reserve_pct\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Only allow small data blocks to be allocated on the special and dedup vdev
|
|
types when the available free space percentage on these vdevs exceeds this
|
|
value. This ensures reserved space is available for pool meta data as the
|
|
special vdevs approach capacity.
|
|
.sp
|
|
Default value: \fB25\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_sync_pass_dont_compress\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Starting in this sync pass, we disable compression (including of metadata).
|
|
With the default setting, in practice, we don't have this many sync passes,
|
|
so this has no effect.
|
|
.sp
|
|
The original intent was that disabling compression would help the sync passes
|
|
to converge. However, in practice disabling compression increases the average
|
|
number of sync passes, because when we turn compression off, a lot of block's
|
|
size will change and thus we have to re-allocate (not overwrite) them. It
|
|
also increases the number of 128KB allocations (e.g. for indirect blocks and
|
|
spacemaps) because these will not be compressed. The 128K allocations are
|
|
especially detrimental to performance on highly fragmented systems, which may
|
|
have very few free segments of this size, and may need to load new metaslabs
|
|
to satisfy 128K allocations.
|
|
.sp
|
|
Default value: \fB8\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_sync_pass_rewrite\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Rewrite new block pointers starting in this pass
|
|
.sp
|
|
Default value: \fB2\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_sync_taskq_batch_pct\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This controls the number of threads used by the dp_sync_taskq. The default
|
|
value of 75% will create a maximum of one thread per cpu.
|
|
.sp
|
|
Default value: \fB75\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_trim_extent_bytes_max\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Maximum size of TRIM command. Ranges larger than this will be split in to
|
|
chunks no larger than \fBzfs_trim_extent_bytes_max\fR bytes before being
|
|
issued to the device.
|
|
.sp
|
|
Default value: \fB134,217,728\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_trim_extent_bytes_min\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Minimum size of TRIM commands. TRIM ranges smaller than this will be skipped
|
|
unless they're part of a larger range which was broken in to chunks. This is
|
|
done because it's common for these small TRIMs to negatively impact overall
|
|
performance. This value can be set to 0 to TRIM all unallocated space.
|
|
.sp
|
|
Default value: \fB32,768\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_trim_metaslab_skip\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Skip uninitialized metaslabs during the TRIM process. This option is useful
|
|
for pools constructed from large thinly-provisioned devices where TRIM
|
|
operations are slow. As a pool ages an increasing fraction of the pools
|
|
metaslabs will be initialized progressively degrading the usefulness of
|
|
this option. This setting is stored when starting a manual TRIM and will
|
|
persist for the duration of the requested TRIM.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_trim_queue_limit\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Maximum number of queued TRIMs outstanding per leaf vdev. The number of
|
|
concurrent TRIM commands issued to the device is controlled by the
|
|
\fBzfs_vdev_trim_min_active\fR and \fBzfs_vdev_trim_max_active\fR module
|
|
options.
|
|
.sp
|
|
Default value: \fB10\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_trim_txg_batch\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
The number of transaction groups worth of frees which should be aggregated
|
|
before TRIM operations are issued to the device. This setting represents a
|
|
trade-off between issuing larger, more efficient TRIM operations and the
|
|
delay before the recently trimmed space is available for use by the device.
|
|
.sp
|
|
Increasing this value will allow frees to be aggregated for a longer time.
|
|
This will result is larger TRIM operations and potentially increased memory
|
|
usage. Decreasing this value will have the opposite effect. The default
|
|
value of 32 was determined to be a reasonable compromise.
|
|
.sp
|
|
Default value: \fB32\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_txg_history\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Historical statistics for the last N txgs will be available in
|
|
\fB/proc/spl/kstat/zfs/<pool>/txgs\fR
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_txg_timeout\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Flush dirty data to disk at least every N seconds (maximum txg duration)
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_aggregate_trim\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Allow TRIM I/Os to be aggregated. This is normally not helpful because
|
|
the extents to be trimmed will have been already been aggregated by the
|
|
metaslab. This option is provided for debugging and performance analysis.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_aggregation_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Max vdev I/O aggregation size
|
|
.sp
|
|
Default value: \fB1,048,576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_aggregation_limit_non_rotating\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Max vdev I/O aggregation size for non-rotating media
|
|
.sp
|
|
Default value: \fB131,072\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_cache_bshift\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Shift size to inflate reads too
|
|
.sp
|
|
Default value: \fB16\fR (effectively 65536).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_cache_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Inflate reads smaller than this value to meet the \fBzfs_vdev_cache_bshift\fR
|
|
size (default 64k).
|
|
.sp
|
|
Default value: \fB16384\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_cache_size\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Total size of the per-disk cache in bytes.
|
|
.sp
|
|
Currently this feature is disabled as it has been found to not be helpful
|
|
for performance and in some cases harmful.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_mirror_rotating_inc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
A number by which the balancing algorithm increments the load calculation for
|
|
the purpose of selecting the least busy mirror member when an I/O immediately
|
|
follows its predecessor on rotational vdevs for the purpose of making decisions
|
|
based on load.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_mirror_rotating_seek_inc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
A number by which the balancing algorithm increments the load calculation for
|
|
the purpose of selecting the least busy mirror member when an I/O lacks
|
|
locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within
|
|
this that are not immediately following the previous I/O are incremented by
|
|
half.
|
|
.sp
|
|
Default value: \fB5\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_mirror_rotating_seek_offset\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum distance for the last queued I/O in which the balancing algorithm
|
|
considers an I/O to have locality.
|
|
See the section "ZFS I/O SCHEDULER".
|
|
.sp
|
|
Default value: \fB1048576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_mirror_non_rotating_inc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
A number by which the balancing algorithm increments the load calculation for
|
|
the purpose of selecting the least busy mirror member on non-rotational vdevs
|
|
when I/Os do not immediately follow one another.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_mirror_non_rotating_seek_inc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
A number by which the balancing algorithm increments the load calculation for
|
|
the purpose of selecting the least busy mirror member when an I/O lacks
|
|
locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within
|
|
this that are not immediately following the previous I/O are incremented by
|
|
half.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_read_gap_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Aggregate read I/O operations if the gap on-disk between them is within this
|
|
threshold.
|
|
.sp
|
|
Default value: \fB32,768\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_write_gap_limit\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Aggregate write I/O over gap
|
|
.sp
|
|
Default value: \fB4,096\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_raidz_impl\fR (string)
|
|
.ad
|
|
.RS 12n
|
|
Parameter for selecting raidz parity implementation to use.
|
|
|
|
Options marked (always) below may be selected on module load as they are
|
|
supported on all systems.
|
|
The remaining options may only be set after the module is loaded, as they
|
|
are available only if the implementations are compiled in and supported
|
|
on the running system.
|
|
|
|
Once the module is loaded, the content of
|
|
/sys/module/zfs/parameters/zfs_vdev_raidz_impl will show available options
|
|
with the currently selected one enclosed in [].
|
|
Possible options are:
|
|
fastest - (always) implementation selected using built-in benchmark
|
|
original - (always) original raidz implementation
|
|
scalar - (always) scalar raidz implementation
|
|
sse2 - implementation using SSE2 instruction set (64bit x86 only)
|
|
ssse3 - implementation using SSSE3 instruction set (64bit x86 only)
|
|
avx2 - implementation using AVX2 instruction set (64bit x86 only)
|
|
avx512f - implementation using AVX512F instruction set (64bit x86 only)
|
|
avx512bw - implementation using AVX512F & AVX512BW instruction sets (64bit x86 only)
|
|
aarch64_neon - implementation using NEON (Aarch64/64 bit ARMv8 only)
|
|
aarch64_neonx2 - implementation using NEON with more unrolling (Aarch64/64 bit ARMv8 only)
|
|
powerpc_altivec - implementation using Altivec (PowerPC only)
|
|
.sp
|
|
Default value: \fBfastest\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_vdev_scheduler\fR (charp)
|
|
.ad
|
|
.RS 12n
|
|
\fBDEPRECATED\fR: This option exists for compatibility with older user
|
|
configurations. It does nothing except print a warning to the kernel log if
|
|
set.
|
|
.sp
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zevent_cols\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When zevents are logged to the console use this as the word wrap width.
|
|
.sp
|
|
Default value: \fB80\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zevent_console\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Log events to the console
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zevent_len_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Max event queue length.
|
|
Events in the queue can be viewed with the \fBzpool events\fR command.
|
|
.sp
|
|
Default value: \fB512\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zevent_retain_max\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Maximum recent zevent records to retain for duplicate checking. Setting
|
|
this value to zero disables duplicate detection.
|
|
.sp
|
|
Default value: \fB2000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zevent_retain_expire_secs\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Lifespan for a recent ereport that was retained for duplicate checking.
|
|
.sp
|
|
Default value: \fB900\fR.
|
|
.RE
|
|
|
|
.na
|
|
\fBzfs_zil_clean_taskq_maxalloc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The maximum number of taskq entries that are allowed to be cached. When this
|
|
limit is exceeded transaction records (itxs) will be cleaned synchronously.
|
|
.sp
|
|
Default value: \fB1048576\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zil_clean_taskq_minalloc\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
The number of taskq entries that are pre-populated when the taskq is first
|
|
created and are immediately available for use.
|
|
.sp
|
|
Default value: \fB1024\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzfs_zil_clean_taskq_nthr_pct\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This controls the number of threads used by the dp_zil_clean_taskq. The default
|
|
value of 100% will create a maximum of one thread per cpu.
|
|
.sp
|
|
Default value: \fB100\fR%.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzil_maxblocksize\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
This sets the maximum block size used by the ZIL. On very fragmented pools,
|
|
lowering this (typically to 36KB) can improve performance.
|
|
.sp
|
|
Default value: \fB131072\fR (128KB).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzil_nocacheflush\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disable the cache flush commands that are normally sent to the disk(s) by
|
|
the ZIL after an LWB write has completed. Setting this will cause ZIL
|
|
corruption on power loss if a volatile out-of-order write cache is enabled.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzil_replay_disable\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Disable intent logging replay. Can be disabled for recovery from corrupted
|
|
ZIL
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzil_slog_bulk\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Limit SLOG write size per commit executed with synchronous priority.
|
|
Any writes above that will be executed with lower (asynchronous) priority
|
|
to limit potential SLOG device abuse by single active ZIL writer.
|
|
.sp
|
|
Default value: \fB786,432\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzio_deadman_log_all\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If non-zero, the zio deadman will produce debugging messages (see
|
|
\fBzfs_dbgmsg_enable\fR) for all zios, rather than only for leaf
|
|
zios possessing a vdev. This is meant to be used by developers to gain
|
|
diagnostic information for hang conditions which don't involve a mutex
|
|
or other locking primitive; typically conditions in which a thread in
|
|
the zio pipeline is looping indefinitely.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzio_decompress_fail_fraction\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
If non-zero, this value represents the denominator of the probability that zfs
|
|
should induce a decompression failure. For instance, for a 5% decompression
|
|
failure rate, this value should be set to 20.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzio_slow_io_ms\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
When an I/O operation takes more than \fBzio_slow_io_ms\fR milliseconds to
|
|
complete is marked as a slow I/O. Each slow I/O causes a delay zevent. Slow
|
|
I/O counters can be seen with "zpool status -s".
|
|
|
|
.sp
|
|
Default value: \fB30,000\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzio_dva_throttle_enabled\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Throttle block allocations in the I/O pipeline. This allows for
|
|
dynamic allocation distribution when devices are imbalanced.
|
|
When enabled, the maximum number of pending allocations per top-level vdev
|
|
is limited by \fBzfs_vdev_queue_depth_pct\fR.
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzio_requeue_io_start_cut_in_line\fR (int)
|
|
.ad
|
|
.RS 12n
|
|
Prioritize requeued I/O
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzio_taskq_batch_pct\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Percentage of online CPUs (or CPU cores, etc) which will run a worker thread
|
|
for I/O. These workers are responsible for I/O work such as compression and
|
|
checksum calculations. Fractional number of CPUs will be rounded down.
|
|
.sp
|
|
The default value of 75 was chosen to avoid using all CPUs which can result in
|
|
latency issues and inconsistent application performance, especially when high
|
|
compression is enabled.
|
|
.sp
|
|
Default value: \fB75\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_inhibit_dev\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Do not create zvol device nodes. This may slightly improve startup time on
|
|
systems with a very large number of zvols.
|
|
.sp
|
|
Use \fB1\fR for yes and \fB0\fR for no (default).
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_major\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Major number for zvol block devices
|
|
.sp
|
|
Default value: \fB230\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_max_discard_blocks\fR (ulong)
|
|
.ad
|
|
.RS 12n
|
|
Discard (aka TRIM) operations done on zvols will be done in batches of this
|
|
many blocks, where block size is determined by the \fBvolblocksize\fR property
|
|
of a zvol.
|
|
.sp
|
|
Default value: \fB16,384\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_prefetch_bytes\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
When adding a zvol to the system prefetch \fBzvol_prefetch_bytes\fR
|
|
from the start and end of the volume. Prefetching these regions
|
|
of the volume is desirable because they are likely to be accessed
|
|
immediately by \fBblkid(8)\fR or by the kernel scanning for a partition
|
|
table.
|
|
.sp
|
|
Default value: \fB131,072\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_request_sync\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
When processing I/O requests for a zvol submit them synchronously. This
|
|
effectively limits the queue depth to 1 for each I/O submitter. When set
|
|
to 0 requests are handled asynchronously by a thread pool. The number of
|
|
requests which can be handled concurrently is controller by \fBzvol_threads\fR.
|
|
.sp
|
|
Default value: \fB0\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_threads\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Max number of threads which can handle zvol I/O requests concurrently.
|
|
.sp
|
|
Default value: \fB32\fR.
|
|
.RE
|
|
|
|
.sp
|
|
.ne 2
|
|
.na
|
|
\fBzvol_volmode\fR (uint)
|
|
.ad
|
|
.RS 12n
|
|
Defines zvol block devices behaviour when \fBvolmode\fR is set to \fBdefault\fR.
|
|
Valid values are \fB1\fR (full), \fB2\fR (dev) and \fB3\fR (none).
|
|
.sp
|
|
Default value: \fB1\fR.
|
|
.RE
|
|
|
|
.SH ZFS I/O SCHEDULER
|
|
ZFS issues I/O operations to leaf vdevs to satisfy and complete I/Os.
|
|
The I/O scheduler determines when and in what order those operations are
|
|
issued. The I/O scheduler divides operations into five I/O classes
|
|
prioritized in the following order: sync read, sync write, async read,
|
|
async write, and scrub/resilver. Each queue defines the minimum and
|
|
maximum number of concurrent operations that may be issued to the
|
|
device. In addition, the device has an aggregate maximum,
|
|
\fBzfs_vdev_max_active\fR. Note that the sum of the per-queue minimums
|
|
must not exceed the aggregate maximum. If the sum of the per-queue
|
|
maximums exceeds the aggregate maximum, then the number of active I/Os
|
|
may reach \fBzfs_vdev_max_active\fR, in which case no further I/Os will
|
|
be issued regardless of whether all per-queue minimums have been met.
|
|
.sp
|
|
For many physical devices, throughput increases with the number of
|
|
concurrent operations, but latency typically suffers. Further, physical
|
|
devices typically have a limit at which more concurrent operations have no
|
|
effect on throughput or can actually cause it to decrease.
|
|
.sp
|
|
The scheduler selects the next operation to issue by first looking for an
|
|
I/O class whose minimum has not been satisfied. Once all are satisfied and
|
|
the aggregate maximum has not been hit, the scheduler looks for classes
|
|
whose maximum has not been satisfied. Iteration through the I/O classes is
|
|
done in the order specified above. No further operations are issued if the
|
|
aggregate maximum number of concurrent operations has been hit or if there
|
|
are no operations queued for an I/O class that has not hit its maximum.
|
|
Every time an I/O is queued or an operation completes, the I/O scheduler
|
|
looks for new operations to issue.
|
|
.sp
|
|
In general, smaller max_active's will lead to lower latency of synchronous
|
|
operations. Larger max_active's may lead to higher overall throughput,
|
|
depending on underlying storage.
|
|
.sp
|
|
The ratio of the queues' max_actives determines the balance of performance
|
|
between reads, writes, and scrubs. E.g., increasing
|
|
\fBzfs_vdev_scrub_max_active\fR will cause the scrub or resilver to complete
|
|
more quickly, but reads and writes to have higher latency and lower throughput.
|
|
.sp
|
|
All I/O classes have a fixed maximum number of outstanding operations
|
|
except for the async write class. Asynchronous writes represent the data
|
|
that is committed to stable storage during the syncing stage for
|
|
transaction groups. Transaction groups enter the syncing state
|
|
periodically so the number of queued async writes will quickly burst up
|
|
and then bleed down to zero. Rather than servicing them as quickly as
|
|
possible, the I/O scheduler changes the maximum number of active async
|
|
write I/Os according to the amount of dirty data in the pool. Since
|
|
both throughput and latency typically increase with the number of
|
|
concurrent operations issued to physical devices, reducing the
|
|
burstiness in the number of concurrent operations also stabilizes the
|
|
response time of operations from other -- and in particular synchronous
|
|
-- queues. In broad strokes, the I/O scheduler will issue more
|
|
concurrent operations from the async write queue as there's more dirty
|
|
data in the pool.
|
|
.sp
|
|
Async Writes
|
|
.sp
|
|
The number of concurrent operations issued for the async write I/O class
|
|
follows a piece-wise linear function defined by a few adjustable points.
|
|
.nf
|
|
|
|
| o---------| <-- zfs_vdev_async_write_max_active
|
|
^ | /^ |
|
|
| | / | |
|
|
active | / | |
|
|
I/O | / | |
|
|
count | / | |
|
|
| / | |
|
|
|-------o | | <-- zfs_vdev_async_write_min_active
|
|
0|_______^______|_________|
|
|
0% | | 100% of zfs_dirty_data_max
|
|
| |
|
|
| `-- zfs_vdev_async_write_active_max_dirty_percent
|
|
`--------- zfs_vdev_async_write_active_min_dirty_percent
|
|
|
|
.fi
|
|
Until the amount of dirty data exceeds a minimum percentage of the dirty
|
|
data allowed in the pool, the I/O scheduler will limit the number of
|
|
concurrent operations to the minimum. As that threshold is crossed, the
|
|
number of concurrent operations issued increases linearly to the maximum at
|
|
the specified maximum percentage of the dirty data allowed in the pool.
|
|
.sp
|
|
Ideally, the amount of dirty data on a busy pool will stay in the sloped
|
|
part of the function between \fBzfs_vdev_async_write_active_min_dirty_percent\fR
|
|
and \fBzfs_vdev_async_write_active_max_dirty_percent\fR. If it exceeds the
|
|
maximum percentage, this indicates that the rate of incoming data is
|
|
greater than the rate that the backend storage can handle. In this case, we
|
|
must further throttle incoming writes, as described in the next section.
|
|
|
|
.SH ZFS TRANSACTION DELAY
|
|
We delay transactions when we've determined that the backend storage
|
|
isn't able to accommodate the rate of incoming writes.
|
|
.sp
|
|
If there is already a transaction waiting, we delay relative to when
|
|
that transaction will finish waiting. This way the calculated delay time
|
|
is independent of the number of threads concurrently executing
|
|
transactions.
|
|
.sp
|
|
If we are the only waiter, wait relative to when the transaction
|
|
started, rather than the current time. This credits the transaction for
|
|
"time already served", e.g. reading indirect blocks.
|
|
.sp
|
|
The minimum time for a transaction to take is calculated as:
|
|
.nf
|
|
min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
|
|
min_time is then capped at 100 milliseconds.
|
|
.fi
|
|
.sp
|
|
The delay has two degrees of freedom that can be adjusted via tunables. The
|
|
percentage of dirty data at which we start to delay is defined by
|
|
\fBzfs_delay_min_dirty_percent\fR. This should typically be at or above
|
|
\fBzfs_vdev_async_write_active_max_dirty_percent\fR so that we only start to
|
|
delay after writing at full speed has failed to keep up with the incoming write
|
|
rate. The scale of the curve is defined by \fBzfs_delay_scale\fR. Roughly speaking,
|
|
this variable determines the amount of delay at the midpoint of the curve.
|
|
.sp
|
|
.nf
|
|
delay
|
|
10ms +-------------------------------------------------------------*+
|
|
| *|
|
|
9ms + *+
|
|
| *|
|
|
8ms + *+
|
|
| * |
|
|
7ms + * +
|
|
| * |
|
|
6ms + * +
|
|
| * |
|
|
5ms + * +
|
|
| * |
|
|
4ms + * +
|
|
| * |
|
|
3ms + * +
|
|
| * |
|
|
2ms + (midpoint) * +
|
|
| | ** |
|
|
1ms + v *** +
|
|
| zfs_delay_scale ----------> ******** |
|
|
0 +-------------------------------------*********----------------+
|
|
0% <- zfs_dirty_data_max -> 100%
|
|
.fi
|
|
.sp
|
|
Note that since the delay is added to the outstanding time remaining on the
|
|
most recent transaction, the delay is effectively the inverse of IOPS.
|
|
Here the midpoint of 500us translates to 2000 IOPS. The shape of the curve
|
|
was chosen such that small changes in the amount of accumulated dirty data
|
|
in the first 3/4 of the curve yield relatively small differences in the
|
|
amount of delay.
|
|
.sp
|
|
The effects can be easier to understand when the amount of delay is
|
|
represented on a log scale:
|
|
.sp
|
|
.nf
|
|
delay
|
|
100ms +-------------------------------------------------------------++
|
|
+ +
|
|
| |
|
|
+ *+
|
|
10ms + *+
|
|
+ ** +
|
|
| (midpoint) ** |
|
|
+ | ** +
|
|
1ms + v **** +
|
|
+ zfs_delay_scale ----------> ***** +
|
|
| **** |
|
|
+ **** +
|
|
100us + ** +
|
|
+ * +
|
|
| * |
|
|
+ * +
|
|
10us + * +
|
|
+ +
|
|
| |
|
|
+ +
|
|
+--------------------------------------------------------------+
|
|
0% <- zfs_dirty_data_max -> 100%
|
|
.fi
|
|
.sp
|
|
Note here that only as the amount of dirty data approaches its limit does
|
|
the delay start to increase rapidly. The goal of a properly tuned system
|
|
should be to keep the amount of dirty data out of that range by first
|
|
ensuring that the appropriate limits are set for the I/O scheduler to reach
|
|
optimal throughput on the backend storage, and then by changing the value
|
|
of \fBzfs_delay_scale\fR to increase the steepness of the curve.
|