zed: Add deadman-slot_off.sh zedlet

Optionally turn off disk's enclosure slot if an I/O is hung
triggering the deadman.

It's possible for outstanding I/O to a misbehaving SCSI disk to
neither promptly complete or return an error.  This can occur due
to retry and recovery actions taken by the SCSI layer, driver, or
disk.  When it occurs the pool will be unresponsive even though
there may be sufficient redundancy configured to proceeded without
this single disk.

When a hung I/O is detected by the kmods it will be posted as a
deadman event.  By default an I/O is considered to be hung after
5 minutes.  This value can be changed with the zfs_deadman_ziotime_ms
module parameter.  If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set
the disk's enclosure slot will be powered off causing the outstanding
I/O to fail.  The ZED will then handle this like a normal disk failure.
By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set.

As part of this change `zfs_deadman_events_per_second` is added
to control the ratelimitting of deadman events independantly of
delay events.  In practice, a single deadman event is sufficient
and more aren't particularly useful.

Alphabetize the zfs_deadman_* entries in zfs.4.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16226
This commit is contained in:
Brian Behlendorf
2024-05-29 10:46:41 -07:00
committed by GitHub
parent 800d59d577
commit 6b95031f56
7 changed files with 106 additions and 14 deletions
+9 -1
View File
@@ -112,6 +112,11 @@ int zfs_vdev_dtl_sm_blksz = (1 << 12);
*/
static unsigned int zfs_slow_io_events_per_second = 20;
/*
* Rate limit deadman "hung IO" events to this many per second.
*/
static unsigned int zfs_deadman_events_per_second = 1;
/*
* Rate limit checksum events after this many checksum errors per second.
*/
@@ -666,7 +671,7 @@ vdev_alloc_common(spa_t *spa, uint_t id, uint64_t guid, vdev_ops_t *ops)
*/
zfs_ratelimit_init(&vd->vdev_delay_rl, &zfs_slow_io_events_per_second,
1);
zfs_ratelimit_init(&vd->vdev_deadman_rl, &zfs_slow_io_events_per_second,
zfs_ratelimit_init(&vd->vdev_deadman_rl, &zfs_deadman_events_per_second,
1);
zfs_ratelimit_init(&vd->vdev_checksum_rl,
&zfs_checksum_events_per_second, 1);
@@ -6476,6 +6481,9 @@ ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, ms_count_limit, UINT, ZMOD_RW,
ZFS_MODULE_PARAM(zfs, zfs_, slow_io_events_per_second, UINT, ZMOD_RW,
"Rate limit slow IO (delay) events to this many per second");
ZFS_MODULE_PARAM(zfs, zfs_, deadman_events_per_second, UINT, ZMOD_RW,
"Rate limit hung IO (deadman) events to this many per second");
/* BEGIN CSTYLED */
ZFS_MODULE_PARAM(zfs, zfs_, checksum_events_per_second, UINT, ZMOD_RW,
"Rate limit checksum events to this many checksum errors per second "