mirror_zfs/module/zfs
Matthew Ahrens 8542ef852a OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations
Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Ported-by: Matt Ahrens <mahrens@delphix.com>

RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity.  Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4.  A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.

To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks.  The contents of these pad sectors are not used, so we do not
need to read or write these sectors.  However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors.  If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance.  The gap-filling performance improvement was introduced in
July 2009.

Writing the optional zio is done by the io aggregation code in
vdev_queue.c.  The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB.  For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.

The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks).  It
cannot occur with the default recordsize=128KB.  If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.

The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors).  Therefore
this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem".

The problem also occurs with the following configurations:

With recordsize=512KB or 256KB, compression=off, the problem occurs only
in rarely-used configurations:
* 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors)
* 4-wide RAIDZ2 (either recordsize, either ashift)
* 5-wide RAIDZ2 with recordsize=512KB (either ashift)
* 6-wide RAIDZ2 with recordsize=512KB (either ashift)

With recordsize=1MB, compression=off, ashift=9 (512B sectors)
* RAIDZ1 with 4 or 8 disks
* RAIDZ2 with 4, 8, or 10 disks
* RAIDZ3 with 6, 8, 9, or 10 disks

With recordsize=1MB, compression=off, ashift=12 (4KB sectors)
* RAIDZ1 with 7 or 8 disks
* RAIDZ2 with 4, 5, or 10 disks
* RAIDZ3 with 6, 9, or 10 disks

With recordsize=2MB and larger (which can only be selected by changing
kernel tunables), many configurations are affected, including with
higher numbers of disks (up to 18 disks with recordsize=2MB).

Increase zfs_vdev_aggregation_limit to allow the optional zio to be
aggregated, thus eliminating the problem.  Setting it to 256KB fixes all
commonly-used configurations.

The solution is to aggregate optional zio's regardless of the
aggregation size limit.

Analysis sponsored by Intel Corp.

OpenZFS-issue: https://www.illumos.org/issues/8005
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321
Closes #5931
2017-04-10 15:21:45 -07:00
..
abd.c Remove dependency on linear ABD 2017-03-29 12:24:51 -07:00
arc.c OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
blkptr.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
bplist.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
bpobj.c panic in bpobj_space(): null pointer dereference 2017-02-09 10:19:12 -08:00
bptree.c OpenZFS 7082 - bptree_iterate() passes wrong args to zfs_dbgmsg() 2017-01-17 14:49:24 -08:00
bqueue.c Fix coverity defects: CID 147565-147567 2016-10-07 13:19:43 -07:00
dbuf_stats.c OpenZFS 6950 - ARC should cache compressed data 2016-09-13 09:58:33 -07:00
dbuf.c OpenZFS 8023 - Panic destroying a metaslab deferred range tree 2017-04-09 16:12:35 -07:00
ddt_zap.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
ddt.c Cache ddt_get_dedup_dspace() value if there was no ddt changes 2016-12-02 16:59:35 -07:00
dmu_diff.c OpenZFS 6950 - ARC should cache compressed data 2016-09-13 09:58:33 -07:00
dmu_object.c Clean up by-dnode code in dmu_tx.c 2017-02-24 13:34:26 -08:00
dmu_objset.c OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
dmu_send.c OpenZFS 7247 - zfs receive of deduplicated stream fails 2017-02-04 09:10:24 -08:00
dmu_traverse.c Add TASKQID_INVALID 2016-11-02 12:14:45 -07:00
dmu_tx.c OpenZFS 7801 - add more by-dnode routines (lint) 2017-03-20 12:33:17 -07:00
dmu_zfetch.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
dmu.c Add missing module_param for zfs_per_txg_dirty_frees_percent 2017-02-07 09:44:03 -08:00
dnode_sync.c OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
dnode.c OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
dsl_bookmark.c OpenZFS 1300 - filename normalization doesn't work for removes 2017-02-02 14:13:41 -08:00
dsl_dataset.c OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
dsl_deadlist.c panic in bpobj_space(): null pointer dereference 2017-02-09 10:19:12 -08:00
dsl_deleg.c Performance optimization of AVL tree comparator functions 2016-08-31 14:35:34 -07:00
dsl_destroy.c OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs 2017-01-27 11:43:42 -08:00
dsl_dir.c OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space 2017-03-07 09:51:59 -08:00
dsl_pool.c OpenZFS 8027 - tighten up dsl_pool_dirty_delta 2017-04-09 16:04:35 -07:00
dsl_prop.c Fix dsl_props_set_sync_impl to work with nested nvlist 2016-12-20 18:46:59 -08:00
dsl_scan.c OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs 2017-01-27 11:43:42 -08:00
dsl_synctask.c Illumos 4951 - ZFS administrative commands should use reserved space 2015-05-04 09:41:10 -07:00
dsl_userhold.c OpenZFS 6314 - buffer overflow in dsl_dataset_name 2016-06-28 13:47:03 -07:00
edonr_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
fm.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
gzip.c GZIP compression offloading with QAT accelerator 2017-03-22 17:58:47 -07:00
lz4.c Fix spelling 2017-01-03 11:31:18 -06:00
lzjb.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
Makefile.in GZIP compression offloading with QAT accelerator 2017-03-22 17:58:47 -07:00
metaslab.c OpenZFS 8023 - Panic destroying a metaslab deferred range tree 2017-04-09 16:12:35 -07:00
multilist.c OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
pathname.c Add pn_alloc()/pn_free() functions 2016-04-21 09:49:25 -07:00
policy.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
qat_compress.c GZIP compression offloading with QAT accelerator 2017-03-22 17:58:47 -07:00
qat_compress.h GZIP compression offloading with QAT accelerator 2017-03-22 17:58:47 -07:00
range_tree.c Performance optimization of AVL tree comparator functions 2016-08-31 14:35:34 -07:00
refcount.c Linux 4.11 compat: avoid refcount_t name conflict 2017-02-28 16:10:18 -08:00
rrwlock.c Fix spelling 2017-01-03 11:31:18 -06:00
sa.c OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS fsid change 2017-01-26 14:43:28 -08:00
sha256.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
skein_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
spa_boot.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_config.c Fix spelling 2017-01-03 11:31:18 -06:00
spa_errlog.c Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t 2014-08-06 14:48:41 -07:00
spa_history.c Fix indefinite article 2016-08-11 11:23:49 -07:00
spa_misc.c OpenZFS 8023 - Panic destroying a metaslab deferred range tree 2017-04-09 16:12:35 -07:00
spa_stats.c Fix spelling 2017-01-03 11:31:18 -06:00
spa.c OpenZFS 3821 - Race in rollback, zil close, and zil flush 2017-03-23 18:20:58 -07:00
space_map.c OpenZFS 8023 - Panic destroying a metaslab deferred range tree 2017-04-09 16:12:35 -07:00
space_reftree.c OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
trace.c OpenZFS 6531 - Provide mechanism to artificially limit disk performance 2016-05-26 10:11:51 -07:00
txg.c Refactor txg history kstat 2016-12-02 16:57:49 -07:00
uberblock.c Illumos 5347 - idle pool may run itself out of space 2015-07-14 10:35:21 -07:00
unique.c Performance optimization of AVL tree comparator functions 2016-08-31 14:35:34 -07:00
vdev_cache.c Fix wrong offset args in vdev_cache_write 2017-03-28 11:06:22 -07:00
vdev_disk.c OpenZFS 7448 - ZFS doesn't notice when disk vdevs have no write cache 2017-02-04 09:23:50 -08:00
vdev_file.c Use a dedicated taskq for vdev_file 2016-12-21 10:47:15 -08:00
vdev_label.c OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
vdev_mirror.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_missing.c Illumos #5244 - zio pipeline callers should explicitly invoke next stage 2015-04-30 15:07:47 -07:00
vdev_queue.c OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations 2017-04-10 15:21:45 -07:00
vdev_raidz_math_aarch64_neon_common.h ABD raidz NEON support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_aarch64_neon.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_aarch64_neonx2.c ABD raidz NEON support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx2.c ABD raidz avx512f support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx512bw.c ABD: Adapt avx512bw raidz assembly 2016-12-15 17:31:33 -08:00
vdev_raidz_math_avx512f.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
vdev_raidz_math_impl.h codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_scalar.c ABD Vectorized raidz 2016-11-29 14:34:33 -08:00
vdev_raidz_math_sse2.c ABD raidz avx512f support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_ssse3.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz.c Remove dependency on linear ABD 2017-03-29 12:24:51 -07:00
vdev_root.c Illumos #3598 2013-10-31 14:58:04 -07:00
vdev.c OpenZFS 7885 - zpool list can report 16.0e for expandsz 2017-04-05 09:33:20 -07:00
zap_leaf.c OpenZFS 1300 - filename normalization doesn't work for removes 2017-02-02 14:13:41 -08:00
zap_micro.c OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space 2017-03-07 09:51:59 -08:00
zap.c OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space 2017-03-07 09:51:59 -08:00
zfeature_common.c OpenZFS 2932 - support crash dumps to raidz, etc. pools 2017-04-10 10:24:17 -07:00
zfeature.c OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
zfs_acl.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_byteswap.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
zfs_ctldir.c Restructure mount option handling 2017-03-10 09:51:41 -08:00
zfs_debug.c OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's 2017-01-28 12:16:43 -08:00
zfs_dir.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_fm.c Fix regression in zfs_ereport_start() 2017-04-05 14:24:26 -07:00
zfs_fuid.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_ioctl.c Restructure mount option handling 2017-03-10 09:51:41 -08:00
zfs_log.c OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
zfs_onexit.c zfsdev_getminor() should check for invalid file handles 2015-06-22 17:02:13 -07:00
zfs_replay.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_rlock.c Fix spelling 2017-01-03 11:31:18 -06:00
zfs_sa.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_vfsops.c OpenZFS 6874 - rollback and receive need to reset ZPL state to what's on disk 2017-03-13 17:42:42 -07:00
zfs_vnops.c Rename zfs_* functions 2017-03-10 09:51:35 -08:00
zfs_znode.c Retry zfs_znode_alloc() in zfs_mknode() 2017-03-23 18:26:50 -07:00
zil.c OpenZFS 3821 - Race in rollback, zil close, and zil flush 2017-03-23 18:20:58 -07:00
zio_checksum.c Remove dependency on linear ABD 2017-03-29 12:24:51 -07:00
zio_compress.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
zio_inject.c Compile zio.h and zio_impl.h mutual include 2016-12-01 16:36:25 -07:00
zio.c Remove dependency on linear ABD 2017-03-29 12:24:51 -07:00
zle.c Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
zpl_ctldir.c Linux 4.11 compat: iops.getattr and friends 2017-03-20 17:51:16 -07:00
zpl_export.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
zpl_file.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zpl_inode.c Linux 4.11 compat: iops.getattr and friends 2017-03-20 17:51:16 -07:00
zpl_super.c Restructure mount option handling 2017-03-10 09:51:41 -08:00
zpl_xattr.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zrlock.c OpenZFS 3746 - ZRLs are racy 2017-01-23 10:35:58 -08:00
zvol.c Fix ZVOL BLKFLSBUF ioctl 2017-03-09 17:43:36 -08:00