zed: Add deadman-slot_off.sh zedlet

Optionally turn off disk's enclosure slot if an I/O is hung
triggering the deadman.

It's possible for outstanding I/O to a misbehaving SCSI disk to
neither promptly complete or return an error.  This can occur due
to retry and recovery actions taken by the SCSI layer, driver, or
disk.  When it occurs the pool will be unresponsive even though
there may be sufficient redundancy configured to proceeded without
this single disk.

When a hung I/O is detected by the kmods it will be posted as a
deadman event.  By default an I/O is considered to be hung after
5 minutes.  This value can be changed with the zfs_deadman_ziotime_ms
module parameter.  If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set
the disk's enclosure slot will be powered off causing the outstanding
I/O to fail.  The ZED will then handle this like a normal disk failure.
By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set.

As part of this change `zfs_deadman_events_per_second` is added
to control the ratelimitting of deadman events independantly of
delay events.  In practice, a single deadman event is sufficient
and more aren't particularly useful.

Alphabetize the zfs_deadman_* entries in zfs.4.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16226
This commit is contained in:
Brian Behlendorf
2024-05-29 10:46:41 -07:00
committed by Tony Hutter
parent 8ca4319f60
commit b0cfb480ca
7 changed files with 106 additions and 14 deletions
+1
View File
@@ -29,6 +29,7 @@ CONDENSE_INDIRECT_OBSOLETE_PCT condense.indirect_obsolete_pct zfs_condense_indir
CONDENSE_MIN_MAPPING_BYTES condense.min_mapping_bytes zfs_condense_min_mapping_bytes
DBUF_CACHE_SHIFT dbuf.cache_shift dbuf_cache_shift
DEADMAN_CHECKTIME_MS deadman.checktime_ms zfs_deadman_checktime_ms
DEADMAN_EVENTS_PER_SECOND deadman_events_per_second zfs_deadman_events_per_second
DEADMAN_FAILMODE deadman.failmode zfs_deadman_failmode
DEADMAN_SYNCTIME_MS deadman.synctime_ms zfs_deadman_synctime_ms
DEADMAN_ZIOTIME_MS deadman.ziotime_ms zfs_deadman_ziotime_ms
@@ -28,7 +28,7 @@
# Verify spa deadman events are rate limited
#
# STRATEGY:
# 1. Reduce the zfs_slow_io_events_per_second to 1.
# 1. Reduce the zfs_deadman_events_per_second to 1.
# 2. Reduce the zfs_deadman_ziotime_ms to 1ms.
# 3. Write data to a pool and read it back.
# 4. Verify deadman events have been produced at a reasonable rate.
@@ -44,15 +44,15 @@ function cleanup
zinject -c all
default_cleanup_noexit
set_tunable64 SLOW_IO_EVENTS_PER_SECOND $OLD_SLOW_IO_EVENTS
set_tunable64 DEADMAN_EVENTS_PER_SECOND $OLD_DEADMAN_EVENTS
set_tunable64 DEADMAN_ZIOTIME_MS $ZIOTIME_DEFAULT
}
log_assert "Verify spa deadman events are rate limited"
log_onexit cleanup
OLD_SLOW_IO_EVENTS=$(get_tunable SLOW_IO_EVENTS_PER_SECOND)
log_must set_tunable64 SLOW_IO_EVENTS_PER_SECOND 1
OLD_DEADMAN_EVENTS=$(get_tunable DEADMAN_EVENTS_PER_SECOND)
log_must set_tunable64 DEADMAN_EVENTS_PER_SECOND 1
log_must set_tunable64 DEADMAN_ZIOTIME_MS 1
# Create a new pool in order to use the updated deadman settings.