2020-01-15 01:57:28 +03:00
|
|
|
# This file exports variables for each tunable used in the test suite.
|
|
|
|
#
|
|
|
|
# Different platforms use different names for most tunables. To avoid littering
|
|
|
|
# the tests with conditional logic for deciding how to set each tunable, the
|
|
|
|
# logic is instead consolidated to this one file.
|
|
|
|
#
|
|
|
|
# Any use of tunables in tests must use a name defined here. New entries
|
|
|
|
# should be added to the table as needed. Please keep the table sorted
|
|
|
|
# alphabetically for ease of maintenance.
|
|
|
|
#
|
|
|
|
# Platform-specific tunables should still use a NAME from this table for
|
|
|
|
# consistency. Enter UNSUPPORTED in the column for platforms on which the
|
|
|
|
# tunable is not implemented.
|
|
|
|
|
|
|
|
UNAME=$(uname)
|
|
|
|
|
|
|
|
# NAME FreeBSD tunable Linux tunable
|
|
|
|
cat <<%%%% |
|
|
|
|
ADMIN_SNAPSHOT UNSUPPORTED zfs_admin_snapshot
|
|
|
|
ALLOW_REDACTED_DATASET_MOUNT allow_redacted_dataset_mount zfs_allow_redacted_dataset_mount
|
|
|
|
ARC_MAX arc.max zfs_arc_max
|
|
|
|
ARC_MIN arc.min zfs_arc_min
|
|
|
|
ASYNC_BLOCK_MAX_BLOCKS async_block_max_blocks zfs_async_block_max_blocks
|
|
|
|
CHECKSUM_EVENTS_PER_SECOND checksum_events_per_second zfs_checksum_events_per_second
|
|
|
|
COMMIT_TIMEOUT_PCT commit_timeout_pct zfs_commit_timeout_pct
|
|
|
|
COMPRESSED_ARC_ENABLED compressed_arc_enabled zfs_compressed_arc_enabled
|
|
|
|
CONDENSE_INDIRECT_COMMIT_ENTRY_DELAY_MS condense.indirect_commit_entry_delay_ms zfs_condense_indirect_commit_entry_delay_ms
|
2021-04-12 07:49:13 +03:00
|
|
|
CONDENSE_INDIRECT_OBSOLETE_PCT condense.indirect_obsolete_pct zfs_condense_indirect_obsolete_pct
|
2020-01-15 01:57:28 +03:00
|
|
|
CONDENSE_MIN_MAPPING_BYTES condense.min_mapping_bytes zfs_condense_min_mapping_bytes
|
2021-07-27 00:47:08 +03:00
|
|
|
DBUF_CACHE_SHIFT dbuf.cache_shift dbuf_cache_shift
|
2024-08-07 02:36:55 +03:00
|
|
|
DDT_ZAP_DEFAULT_BS dedup.ddt_zap_default_bs ddt_zap_default_bs
|
|
|
|
DDT_ZAP_DEFAULT_IBS dedup.ddt_zap_default_ibs ddt_zap_default_ibs
|
|
|
|
DDT_DATA_IS_SPECIAL ddt_data_is_special zfs_ddt_data_is_special
|
ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.
Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.
A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.
The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.
All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2023-06-22 10:46:22 +03:00
|
|
|
DEDUP_LOG_TXG_MAX dedup.log_txg_max zfs_dedup_log_txg_max
|
2021-03-12 06:23:24 +03:00
|
|
|
DEADMAN_CHECKTIME_MS deadman.checktime_ms zfs_deadman_checktime_ms
|
2024-05-29 20:46:41 +03:00
|
|
|
DEADMAN_EVENTS_PER_SECOND deadman_events_per_second zfs_deadman_events_per_second
|
2021-03-12 06:23:24 +03:00
|
|
|
DEADMAN_FAILMODE deadman.failmode zfs_deadman_failmode
|
|
|
|
DEADMAN_SYNCTIME_MS deadman.synctime_ms zfs_deadman_synctime_ms
|
|
|
|
DEADMAN_ZIOTIME_MS deadman.ziotime_ms zfs_deadman_ziotime_ms
|
2020-01-29 22:25:15 +03:00
|
|
|
DISABLE_IVSET_GUID_CHECK disable_ivset_guid_check zfs_disable_ivset_guid_check
|
2021-11-08 00:27:44 +03:00
|
|
|
DMU_OFFSET_NEXT_SYNC dmu_offset_next_sync zfs_dmu_offset_next_sync
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
EMBEDDED_SLOG_MIN_MS embedded_slog_min_ms zfs_embedded_slog_min_ms
|
2020-01-15 01:57:28 +03:00
|
|
|
INITIALIZE_CHUNK_SIZE initialize_chunk_size zfs_initialize_chunk_size
|
|
|
|
INITIALIZE_VALUE initialize_value zfs_initialize_value
|
|
|
|
KEEP_LOG_SPACEMAPS_AT_EXPORT keep_log_spacemaps_at_export zfs_keep_log_spacemaps_at_export
|
2020-07-28 02:11:47 +03:00
|
|
|
LUA_MAX_MEMLIMIT lua.max_memlimit zfs_lua_max_memlimit
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
L2ARC_MFUONLY l2arc.mfuonly l2arc_mfuonly
|
2020-03-31 20:46:48 +03:00
|
|
|
L2ARC_NOPREFETCH l2arc.noprefetch l2arc_noprefetch
|
2020-04-14 21:36:28 +03:00
|
|
|
L2ARC_REBUILD_BLOCKS_MIN_L2SIZE l2arc.rebuild_blocks_min_l2size l2arc_rebuild_blocks_min_l2size
|
|
|
|
L2ARC_REBUILD_ENABLED l2arc.rebuild_enabled l2arc_rebuild_enabled
|
2020-08-01 07:18:32 +03:00
|
|
|
L2ARC_TRIM_AHEAD l2arc.trim_ahead l2arc_trim_ahead
|
2020-03-31 20:46:48 +03:00
|
|
|
L2ARC_WRITE_BOOST l2arc.write_boost l2arc_write_boost
|
|
|
|
L2ARC_WRITE_MAX l2arc.write_max l2arc_write_max
|
2020-01-15 01:57:28 +03:00
|
|
|
LIVELIST_CONDENSE_NEW_ALLOC livelist.condense.new_alloc zfs_livelist_condense_new_alloc
|
|
|
|
LIVELIST_CONDENSE_SYNC_CANCEL livelist.condense.sync_cancel zfs_livelist_condense_sync_cancel
|
|
|
|
LIVELIST_CONDENSE_SYNC_PAUSE livelist.condense.sync_pause zfs_livelist_condense_sync_pause
|
|
|
|
LIVELIST_CONDENSE_ZTHR_CANCEL livelist.condense.zthr_cancel zfs_livelist_condense_zthr_cancel
|
|
|
|
LIVELIST_CONDENSE_ZTHR_PAUSE livelist.condense.zthr_pause zfs_livelist_condense_zthr_pause
|
|
|
|
LIVELIST_MAX_ENTRIES livelist.max_entries zfs_livelist_max_entries
|
|
|
|
LIVELIST_MIN_PERCENT_SHARED livelist.min_percent_shared zfs_livelist_min_percent_shared
|
|
|
|
MAX_DATASET_NESTING max_dataset_nesting zfs_max_dataset_nesting
|
|
|
|
MAX_MISSING_TVDS max_missing_tvds zfs_max_missing_tvds
|
|
|
|
METASLAB_DEBUG_LOAD metaslab.debug_load metaslab_debug_load
|
|
|
|
METASLAB_FORCE_GANGING metaslab.force_ganging metaslab_force_ganging
|
|
|
|
MULTIHOST_FAIL_INTERVALS multihost.fail_intervals zfs_multihost_fail_intervals
|
2020-09-02 02:19:19 +03:00
|
|
|
MULTIHOST_HISTORY multihost.history zfs_multihost_history
|
2020-01-15 01:57:28 +03:00
|
|
|
MULTIHOST_IMPORT_INTERVALS multihost.import_intervals zfs_multihost_import_intervals
|
2020-06-23 23:32:42 +03:00
|
|
|
MULTIHOST_INTERVAL multihost.interval zfs_multihost_interval
|
2020-01-15 01:57:28 +03:00
|
|
|
OVERRIDE_ESTIMATE_RECORDSIZE send.override_estimate_recordsize zfs_override_estimate_recordsize
|
2020-10-06 01:29:05 +03:00
|
|
|
PREFETCH_DISABLE prefetch.disable zfs_prefetch_disable
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
RAIDZ_EXPAND_MAX_REFLOW_BYTES vdev.expand_max_reflow_bytes raidz_expand_max_reflow_bytes
|
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID. This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.
A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`. No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.
zpool create <pool> draid[1,2,3] <vdevs...>
Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons. The supported options include:
zpool create <pool> \
draid[<parity>][:<data>d][:<children>c][:<spares>s] \
<vdevs...>
- draid[parity] - Parity level (default 1)
- draid[:<data>d] - Data devices per group (default 8)
- draid[:<children>c] - Expected number of child vdevs
- draid[:<spares>s] - Distributed hot spares (default 0)
Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.
```
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
slag7 ONLINE 0 0 0
draid2:8d:68c:2s-0 ONLINE 0 0 0
L0 ONLINE 0 0 0
L1 ONLINE 0 0 0
...
U25 ONLINE 0 0 0
U26 ONLINE 0 0 0
spare-53 ONLINE 0 0 0
U27 ONLINE 0 0 0
draid2-0-0 ONLINE 0 0 0
U28 ONLINE 0 0 0
U29 ONLINE 0 0 0
...
U42 ONLINE 0 0 0
U43 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
L5 ONLINE 0 0 0
U5 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
L6 ONLINE 0 0 0
U6 ONLINE 0 0 0
spares
draid2-0-0 INUSE currently in use
draid2-0-1 AVAIL
```
When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command. These options are leverages
by zloop.sh to test a wide range of dRAID configurations.
-K draid|raidz|random - kind of RAID to test
-D <value> - dRAID data drives per group
-S <value> - dRAID distributed hot spares
-R <value> - RAID parity (raidz or dRAID)
The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-14 00:51:51 +03:00
|
|
|
REBUILD_SCRUB_ENABLED rebuild_scrub_enabled zfs_rebuild_scrub_enabled
|
2020-01-15 01:57:28 +03:00
|
|
|
REMOVAL_SUSPEND_PROGRESS removal_suspend_progress zfs_removal_suspend_progress
|
|
|
|
REMOVE_MAX_SEGMENT remove_max_segment zfs_remove_max_segment
|
|
|
|
RESILVER_MIN_TIME_MS resilver_min_time_ms zfs_resilver_min_time_ms
|
Resilver restarts unnecessarily when it encounters errors
When a resilver finishes, vdev_dtl_reassess is called to hopefully
excise DTL_MISSING (amongst other things). If there are errors during
the resilver, they are tracked in DTL_SCRUB, as spelled out in the
block comment in vdev.c. DTL_SCRUB is in-core only, so it can only
be used if the pool was online for the whole resilver. This state is
tracked with the spa_scrub_started flag, which only gets set when
the scan is initialized. Unfortunately, this flag gets cleared right
before vdev_dtl_reassess gets called, so if there are any errors
during the scan, DTL_MISSING will never get excised and the resilver
will just continually restart. This fix simply moves clearing that
flag until after the call to vdev_dtl_reasses.
In addition, if a pool is imported and already has scn_errors > 0,
this change will restart the resilver immediately instead of doing
the rest of the scan and then restarting it from the beginning. On
the other hand, if scn_errors == 0 at import, then no errors have
been encountered so far, so the spa_scrub_started flag can be safely
set.
A test has been added to verify that resilver does not restart when
relevant DTL's are available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #10291
2020-05-13 20:54:27 +03:00
|
|
|
SCAN_LEGACY scan_legacy zfs_scan_legacy
|
2020-01-15 01:57:28 +03:00
|
|
|
SCAN_SUSPEND_PROGRESS scan_suspend_progress zfs_scan_suspend_progress
|
|
|
|
SCAN_VDEV_LIMIT scan_vdev_limit zfs_scan_vdev_limit
|
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 21:19:41 +03:00
|
|
|
SCRUB_AFTER_EXPAND scrub_after_expand zfs_scrub_after_expand
|
2020-01-15 01:57:28 +03:00
|
|
|
SEND_HOLES_WITHOUT_BIRTH_TIME send_holes_without_birth_time send_holes_without_birth_time
|
|
|
|
SLOW_IO_EVENTS_PER_SECOND slow_io_events_per_second zfs_slow_io_events_per_second
|
|
|
|
SPA_ASIZE_INFLATION spa.asize_inflation spa_asize_inflation
|
|
|
|
SPA_DISCARD_MEMORY_LIMIT spa.discard_memory_limit zfs_spa_discard_memory_limit
|
|
|
|
SPA_LOAD_VERIFY_DATA spa.load_verify_data spa_load_verify_data
|
|
|
|
SPA_LOAD_VERIFY_METADATA spa.load_verify_metadata spa_load_verify_metadata
|
|
|
|
TRIM_EXTENT_BYTES_MIN trim.extent_bytes_min zfs_trim_extent_bytes_min
|
|
|
|
TRIM_METASLAB_SKIP trim.metaslab_skip zfs_trim_metaslab_skip
|
|
|
|
TRIM_TXG_BATCH trim.txg_batch zfs_trim_txg_batch
|
2020-09-02 02:19:19 +03:00
|
|
|
TXG_HISTORY txg.history zfs_txg_history
|
2020-04-15 21:14:47 +03:00
|
|
|
TXG_TIMEOUT txg.timeout zfs_txg_timeout
|
2020-01-15 01:57:28 +03:00
|
|
|
UNLINK_SUSPEND_PROGRESS UNSUPPORTED zfs_unlink_suspend_progress
|
Improve too large physical ashift handling
When iterating through children physical ashifts for vdev, prefer
ones above the maximum logical ashift, that we can actually use,
but within the administrator defined maximum.
When selecting top-level vdev ashift, do not set it to the defined
maximum in case physical ashift is even higher, but just ignore one.
Using the maximum does not prevent misaligned writes, but reduces
space efficiency. Since ZFS tries to write data sequentially and
aggregates the writes, in many cases large misanigned writes may be
not as bad as the space penalty otherwise.
Allow internal physical ashifts for vdevs higher than SHIFT_MAX.
May be one day allocator or aggregation could benefit from that.
Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB),
so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB),
but only if it really has to or explicitly told to, but not as an
"optimization".
There are some read-intensive NVMe SSDs that report Preferred Write
Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a
space inefficiency that can't be justified. Instead these changes
make ZFS fall back to logical ashift of 12 (4KB) by default and
only warn user that it may be suboptimal for performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #13798
2022-09-08 20:30:53 +03:00
|
|
|
VDEV_FILE_LOGICAL_ASHIFT vdev.file.logical_ashift vdev_file_logical_ashift
|
2020-09-18 22:13:47 +03:00
|
|
|
VDEV_FILE_PHYSICAL_ASHIFT vdev.file.physical_ashift vdev_file_physical_ashift
|
Improve too large physical ashift handling
When iterating through children physical ashifts for vdev, prefer
ones above the maximum logical ashift, that we can actually use,
but within the administrator defined maximum.
When selecting top-level vdev ashift, do not set it to the defined
maximum in case physical ashift is even higher, but just ignore one.
Using the maximum does not prevent misaligned writes, but reduces
space efficiency. Since ZFS tries to write data sequentially and
aggregates the writes, in many cases large misanigned writes may be
not as bad as the space penalty otherwise.
Allow internal physical ashifts for vdevs higher than SHIFT_MAX.
May be one day allocator or aggregation could benefit from that.
Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB),
so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB),
but only if it really has to or explicitly told to, but not as an
"optimization".
There are some read-intensive NVMe SSDs that report Preferred Write
Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a
space inefficiency that can't be justified. Instead these changes
make ZFS fall back to logical ashift of 12 (4KB) by default and
only warn user that it may be suboptimal for performance.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #13798
2022-09-08 20:30:53 +03:00
|
|
|
VDEV_MAX_AUTO_ASHIFT vdev.max_auto_ashift zfs_vdev_max_auto_ashift
|
2020-01-15 01:57:28 +03:00
|
|
|
VDEV_MIN_MS_COUNT vdev.min_ms_count zfs_vdev_min_ms_count
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
VDEV_DIRECT_WR_VERIFY vdev.direct_write_verify zfs_vdev_direct_write_verify
|
2020-01-15 01:57:28 +03:00
|
|
|
VDEV_VALIDATE_SKIP vdev.validate_skip vdev_validate_skip
|
|
|
|
VOL_INHIBIT_DEV UNSUPPORTED zvol_inhibit_dev
|
|
|
|
VOL_MODE vol.mode zvol_volmode
|
|
|
|
VOL_RECURSIVE vol.recursive UNSUPPORTED
|
2023-10-24 00:39:59 +03:00
|
|
|
VOL_USE_BLK_MQ UNSUPPORTED zvol_use_blk_mq
|
2024-08-07 02:36:55 +03:00
|
|
|
BCLONE_ENABLED bclone_enabled zfs_bclone_enabled
|
|
|
|
BCLONE_WAIT_DIRTY bclone_wait_dirty zfs_bclone_wait_dirty
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
DIO_ENABLED dio_enabled zfs_dio_enabled
|
2022-02-16 03:35:30 +03:00
|
|
|
XATTR_COMPAT xattr_compat zfs_xattr_compat
|
2020-02-18 22:22:56 +03:00
|
|
|
ZEVENT_LEN_MAX zevent.len_max zfs_zevent_len_max
|
2020-09-04 20:34:28 +03:00
|
|
|
ZEVENT_RETAIN_MAX zevent.retain_max zfs_zevent_retain_max
|
2020-01-15 01:57:28 +03:00
|
|
|
ZIO_SLOW_IO_MS zio.slow_io_ms zio_slow_io_ms
|
log xattr=sa create/remove/update to ZIL
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.
This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.
Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:
1. The feature must be enabled. Otherwise, we don't log the new
record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
the ZIL chain will activate the feature for the dataset. Likewise
for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.
We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.
We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.
One more piece on feature activation, since it's spread across
multiple functions:
zil_commit()
zil_process_commit_list()
if lwb == NULL // first zil_commit since zil_open
zil_create()
if no log block pointer in ZIL header:
if feature enabled and not active:
// CASE 1
enable, COALESCE txg wait with dmu_tx that allocated the
log block
else // log block was allocated earlier than this zil_open
if feature enabled and not active:
// CASE 2
enable, EXPLICIT txg wait
else // already have an in-DRAM LWB
if feature enabled and not active:
// this happens when we enable the feature after zil_create
// CASE 3
enable, EXPLICIT txg wait
[1] https://github.com/illumos/illumos-gate/commit/da6c28aaf62fa55f0fdb8004aa40f88f23bf53f0
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #8768
Closes #9078
2022-02-23 00:06:43 +03:00
|
|
|
ZIL_SAXATTR zil_saxattr zfs_zil_saxattr
|
2020-01-15 01:57:28 +03:00
|
|
|
%%%%
|
|
|
|
while read name FreeBSD Linux; do
|
|
|
|
eval "export ${name}=\$${UNAME}"
|
|
|
|
done
|