mirror_zfs/module/zfs
Serapheim Dimitropoulos 4dcc2bde9c
Stop ganging due to past vdev write errors
= Problem

While examining a customer's system we noticed unreasonable space
usage from a few snapshots due to gang blocks. Under some further
analysis we discovered that the pool would create gang blocks because
all its disks had non-zero write error counts and they'd be skipped
for normal metaslab allocations due to the following if-clause in
`metaslab_alloc_dva()`:
```
	/*
	 * Avoid writing single-copy data to a failing,
	 * non-redundant vdev, unless we've already tried all
	 * other vdevs.
	 */
	if ((vd->vdev_stat.vs_write_errors > 0 ||
	    vd->vdev_state < VDEV_STATE_HEALTHY) &&
	    d == 0 && !try_hard && vd->vdev_children == 0) {
		metaslab_trace_add(zal, mg, NULL, psize, d,
		    TRACE_VDEV_ERROR, allocator);
		goto next;
	}
```

= Proposed Solution

Get rid of the predicate in the if-clause that checks the past
write errors of the selected vdev. We still try to allocate from
HEALTHY vdevs anyway by checking vdev_state so the past write
errors doesn't seem to help us (quite the opposite - it can cause
issues in long-lived pools like the one from our customer).

= Testing

I first created a pool with 3 vdevs:
```
$ zpool list -v volpool
NAME        SIZE  ALLOC   FREE
volpool    22.5G   117M  22.4G
  xvdb     7.99G  40.2M  7.46G
  xvdc     7.99G  39.1M  7.46G
  xvdd     7.99G  37.8M  7.46G
```

And used `zinject` like so with each one of them:
```
$ sudo zinject -d xvdb -e io -T write -f 0.1 volpool
```

And got the vdevs to the following state:
```
$ zpool status volpool
  pool: volpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
...<cropped>..
action: Determine if the device needs to be replaced, and clear the
...<cropped>..
config:

	NAME        STATE     READ WRITE CKSUM
	volpool     ONLINE       0     0     0
	  xvdb      ONLINE       0     1     0
	  xvdc      ONLINE       0     1     0
	  xvdd      ONLINE       0     4     0

```

I also double-checked their write error counters with sdb:
```
sdb> spa volpool | vdev | member vdev_stat.vs_write_errors
(uint64_t)0  # <---- this is the root vdev
(uint64_t)2
(uint64_t)1
(uint64_t)1
```

Then I checked that I the problem was reproduced in my VM as I the
gang count was growing in zdb as I was writting more data:
```
$ sudo zdb volpool | grep gang
        ganged count:              1384

$ sudo zdb volpool | grep gang
        ganged count:              1393

$ sudo zdb volpool | grep gang
        ganged count:              1402

$ sudo zdb volpool | grep gang
        ganged count:              1414
```

Then I updated my bits with this patch and the gang count stayed the
same.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14003
2022-10-11 12:27:41 -07:00
..
abd.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
aggsum.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
arc.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
blake3_zfs.c Fix memory allocation issue for BLAKE3 context 2022-06-21 14:32:09 -07:00
blkptr.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
bplist.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
bpobj.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
bptree.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
bqueue.c zfs recv hangs if max recordsize is less than received recordsize 2022-09-16 13:52:25 -07:00
btree.c Add zfs_btree_verify_intensity kernel module parameter 2022-09-15 16:22:33 -07:00
dataset_kstats.c Add support for per dataset zil stats and use wmsum counters 2022-07-20 17:14:06 -07:00
dbuf_stats.c Revert "Reduce dbuf_find() lock contention" 2022-09-22 12:59:41 -07:00
dbuf.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
ddt_zap.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
ddt.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dmu_diff.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dmu_object.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
dmu_objset.c Revert "Avoid panic with recordsize > 128k, raw sending and no large_blocks" 2022-08-25 13:33:32 -07:00
dmu_recv.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
dmu_redact.c Fix incorrect size given to bqueue_enqueue() call in dmu_redact.c 2022-09-15 16:21:21 -07:00
dmu_send.c Fix unsafe string operations 2022-09-27 16:47:24 -07:00
dmu_traverse.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
dmu_tx.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dmu_zfetch.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dmu.c Fix double const qualifier declarations 2022-09-30 15:34:39 -07:00
dnode_sync.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dnode.c Cleanup: Use OpenSolaris functions to call scheduler 2022-09-12 09:55:37 -07:00
dsl_bookmark.c Remaining {=> const} char|void *tag 2022-06-29 14:08:59 -07:00
dsl_crypt.c Fix zpool status in case of unloaded keys 2022-08-22 17:42:01 -07:00
dsl_dataset.c Fix potential NULL pointer dereference in dsl_dataset_promote_check() 2022-09-30 16:59:51 -07:00
dsl_deadlist.c Fix panic in dsl_process_sub_livelist for EINTR 2022-09-29 09:39:48 -07:00
dsl_deleg.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dsl_destroy.c Prevent zevent list from consuming all of kernel memory 2022-08-22 12:36:22 -07:00
dsl_dir.c Cleanup: Switch to strlcpy from strncpy 2022-09-27 16:35:29 -07:00
dsl_pool.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
dsl_prop.c Cleanup: Switch to strlcpy from strncpy 2022-09-27 16:35:29 -07:00
dsl_scan.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
dsl_synctask.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
dsl_userhold.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
edonr_zfs.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
fm.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
gzip.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
hkdf.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
lz4_zfs.c Updated the lz4 decompressor 2022-01-07 10:36:49 -08:00
lz4.c lz4: Cherrypick fix for CVE-2021-3520 2022-01-12 16:14:36 -08:00
lzjb.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
metaslab.c Stop ganging due to past vdev write errors 2022-10-11 12:27:41 -07:00
mmp.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
multilist.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
objlist.c
pathname.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
range_tree.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
refcount.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
rrwlock.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
sa.c Fix double const qualifier declarations 2022-09-30 15:34:39 -07:00
sha256.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
skein_zfs.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
spa_checkpoint.c Fix usage of zed_log_msg() and zfs_panic_recover() 2022-09-19 17:32:18 -07:00
spa_config.c zed: mark disks as REMOVED when they are removed 2022-09-28 09:48:46 -07:00
spa_errlog.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
spa_history.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
spa_log_spacemap.c Cleanup: Use OpenSolaris functions to call scheduler 2022-09-12 09:55:37 -07:00
spa_misc.c zed: mark disks as REMOVED when they are removed 2022-09-28 09:48:46 -07:00
spa_stats.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
spa.c zed: mark disks as REMOVED when they are removed 2022-09-28 09:48:46 -07:00
space_map.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
space_reftree.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
THIRDPARTYLICENSE.cityhash
THIRDPARTYLICENSE.cityhash.descrip
txg.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
uberblock.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
unique.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_cache.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
vdev_draid_rand.c
vdev_draid.c vdev_draid_lookup_map() should not iterate outside draid_maps 2022-09-12 12:51:17 -07:00
vdev_indirect_births.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
vdev_indirect_mapping.c Remove bcopy(), bzero(), bcmp() 2022-03-15 15:13:42 -07:00
vdev_indirect.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
vdev_initialize.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_label.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_mirror.c Improve too large physical ashift handling 2022-09-08 10:30:53 -07:00
vdev_missing.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_queue.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
vdev_raidz_math_aarch64_neon_common.h Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_aarch64_neon.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_aarch64_neonx2.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_avx2.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_avx512bw.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_avx512f.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_impl.h Cleanup Raid-Z Typo fixes 2022-09-06 09:43:21 -07:00
vdev_raidz_math_powerpc_altivec_common.h Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_powerpc_altivec.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_scalar.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_sse2.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math_ssse3.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz_math.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_raidz.c Improve too large physical ashift handling 2022-09-08 10:30:53 -07:00
vdev_rebuild.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_removal.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
vdev_root.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev_trim.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
vdev.c Fix uninitialized value read in vdev_prop_set() 2022-10-11 12:24:36 -07:00
zap_leaf.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zap_micro.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zap.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zcp_get.c Cleanup: Switch to strlcpy from strncpy 2022-09-27 16:35:29 -07:00
zcp_global.c
zcp_iter.c module/*.ko: prune .data, global .rodata 2022-01-14 15:37:55 -08:00
zcp_set.c
zcp_synctask.c Add zfs.sync.snapshot_rename 2022-09-02 13:31:19 -07:00
zcp.c Remaining {=> const} char|void *tag 2022-06-29 14:08:59 -07:00
zfeature.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_byteswap.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_chksum.c Fix BLAKE3 tuneable and module loading on Linux and FreeBSD 2022-09-16 14:25:53 -07:00
zfs_fm.c Fix unchecked return values 2022-09-29 09:02:57 -07:00
zfs_fuid.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_ioctl.c zed: mark disks as REMOVED when they are removed 2022-09-28 09:48:46 -07:00
zfs_log.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_onexit.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_quota.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_ratelimit.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_replay.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_rlock.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_sa.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zfs_vnops.c zfs_enter rework 2022-09-16 13:36:47 -07:00
zil.c Cleanup: Specify unsignedness on things that should not be signed 2022-09-27 16:42:41 -07:00
zio_checksum.c Fix double const qualifier declarations 2022-09-30 15:34:39 -07:00
zio_compress.c Fix double const qualifier declarations 2022-09-30 15:34:39 -07:00
zio_inject.c Cleanup: Switch to strlcpy from strncpy 2022-09-27 16:35:29 -07:00
zio.c Avoid unnecessary metaslab_check_free calling 2022-10-04 10:55:35 -07:00
zle.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zrlock.c Replace dead opensolaris.org license link 2022-07-11 14:16:13 -07:00
zthr.c Switch from _Noreturn to __attribute__((noreturn)) 2022-03-23 08:51:00 -07:00
zvol.c Fix unchecked return values and unused return values 2022-09-23 16:52:03 -07:00