Add slow disk diagnosis to ZED

Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15469
This commit is contained in:
Don Brady
2024-02-08 10:19:52 -07:00
committed by GitHub
parent 229b9f4ed0
commit cbe882298e
29 changed files with 655 additions and 71 deletions
+2 -2
View File
@@ -260,8 +260,8 @@ sufficient replicas exist to continue functioning.
The underlying conditions are as follows:
.Bl -bullet -compact
.It
The number of checksum errors exceeds acceptable levels and the device is
degraded as an indication that something may be wrong.
The number of checksum errors or slow I/Os exceeds acceptable levels and the
device is degraded as an indication that something may be wrong.
ZFS continues to use the device as necessary.
.It
The number of I/O errors exceeds acceptable levels.