Detect a slow raidz child during reads

A single slow responding disk can affect the overall read
performance of a raidz group.  When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.

Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.

The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter.  Setting it to
zero disables slow outlier detection.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Contributions-by: Don Brady <don.brady@klarasystems.com>
Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17227
This commit is contained in:
Paul Dagnelie
2025-08-27 16:41:48 -07:00
committed by Brian Behlendorf
parent 0620c979a5
commit d64711c202
28 changed files with 1399 additions and 13 deletions
+5 -4
View File
@@ -940,10 +940,11 @@ tags = ['functional', 'rename_dirs']
[tests/functional/replacement]
tests = ['attach_import', 'attach_multiple', 'attach_rebuild',
'attach_resilver', 'detach', 'rebuild_disabled_feature',
'rebuild_multiple', 'rebuild_raidz', 'replace_import', 'replace_rebuild',
'replace_resilver', 'resilver_restart_001', 'resilver_restart_002',
'scrub_cancel']
'attach_resilver', 'attach_resilver_sit_out', 'detach',
'rebuild_disabled_feature', 'rebuild_multiple', 'rebuild_raidz',
'replace_import', 'replace_rebuild', 'replace_resilver',
'replace_resilver_sit_out', 'resilver_restart_001',
'resilver_restart_002', 'scrub_cancel']
tags = ['functional', 'replacement']
[tests/functional/reservation]
+2 -1
View File
@@ -109,7 +109,8 @@ tags = ['functional', 'direct']
[tests/functional/events:Linux]
tests = ['events_001_pos', 'events_002_pos', 'zed_rc_filter', 'zed_fd_spill',
'zed_cksum_reported', 'zed_cksum_config', 'zed_io_config',
'zed_slow_io', 'zed_slow_io_many_vdevs', 'zed_diagnose_multiple']
'zed_slow_io', 'zed_slow_io_many_vdevs', 'zed_diagnose_multiple',
'slow_vdev_sit_out', 'slow_vdev_sit_out_neg', 'slow_vdev_degraded_sit_out']
tags = ['functional', 'events']
[tests/functional/fallocate:Linux]