Files
mirror_zfs/module/zfs
Serapheim Dimitropoulos 37d5a3e04b Stop ganging due to past vdev write errors
= Problem

While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
	/*
	 * Avoid writing single-copy data to a failing,
	 * non-redundant vdev, unless we've already tried all
	 * other vdevs.
	 */
	if ((vd->vdev_stat.vs_write_errors > 0 ||
	    vd->vdev_state < VDEV_STATE_HEALTHY) &&
	    d == 0 && !try_hard && vd->vdev_children == 0) {
		metaslab_trace_add(zal, mg, NULL, psize, d,
		    TRACE_VDEV_ERROR, allocator);
		goto next;
	}
```

= Proposed Solution

Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).

= Testing

I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME        SIZE  ALLOC   FREE
volpool    22.5G   117M  22.4G
  xvdb     7.99G  40.2M  7.46G
  xvdc     7.99G  39.1M  7.46G
  xvdd     7.99G  37.8M  7.46G
```

And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```

And got the vdevs to the following state:
```
$ zpool status volpool
  pool: volpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:

	NAME        STATE     READ WRITE CKSUM
	volpool     ONLINE       0     0     0
	  xvdb      ONLINE       0     1     0
	  xvdc      ONLINE       0     1     0
	  xvdd      ONLINE       0     4     0

```

I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0  # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```

Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
        ganged count:              1384

$ sudo zdb volpool | grep gang
        ganged count:              1393

$ sudo zdb volpool | grep gang
        ganged count:              1402

$ sudo zdb volpool | grep gang
        ganged count:              1414
```

Then I updated my bits with this patch and the gang count stayed the
same.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14003
2022-11-01 12:36:25 -07:00
..
2022-07-26 10:10:37 -07:00
2021-06-09 13:05:34 -07:00
2020-08-20 10:30:06 -07:00
2019-07-26 10:54:14 -07:00
2021-06-09 13:05:34 -07:00
2020-11-02 11:51:12 -08:00
2022-09-26 14:55:27 -07:00
2022-09-26 14:55:27 -07:00
2020-06-18 12:21:25 -07:00
2022-02-16 17:58:56 -08:00
2022-02-16 17:58:56 -08:00
2020-11-13 13:51:51 -08:00
2021-09-14 12:10:17 -07:00
2019-06-19 09:48:12 -07:00
2020-07-29 16:35:33 -07:00
2022-02-16 17:58:56 -08:00
2020-06-18 12:21:25 -07:00
2020-06-18 12:21:25 -07:00
2021-02-20 20:16:50 -08:00
2021-06-24 13:12:36 -07:00
2022-07-26 10:10:37 -07:00
2021-06-10 10:50:16 -07:00
2022-07-26 10:10:37 -07:00
2022-09-21 16:12:14 -07:00
2019-10-09 10:36:03 -07:00
2021-11-02 09:50:30 -07:00
2021-11-02 09:50:30 -07:00
2022-07-26 10:10:37 -07:00
2020-11-13 13:51:51 -08:00
2020-06-18 12:21:25 -07:00
2019-09-02 17:56:41 -07:00
2022-02-16 17:58:56 -08:00
2020-10-02 17:44:10 -07:00
2020-06-18 12:20:38 -07:00
2022-05-02 15:42:58 -07:00
2021-11-02 09:50:30 -07:00
2021-01-20 21:27:30 -08:00
2022-02-16 17:58:56 -08:00
2020-06-18 12:21:25 -07:00
2020-06-18 12:21:18 -07:00