Go to file
Matthew Ahrens 8542ef852a OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations
Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Ported-by: Matt Ahrens <mahrens@delphix.com>

RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity.  Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4.  A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.

To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks.  The contents of these pad sectors are not used, so we do not
need to read or write these sectors.  However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors.  If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance.  The gap-filling performance improvement was introduced in
July 2009.

Writing the optional zio is done by the io aggregation code in
vdev_queue.c.  The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB.  For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.

The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks).  It
cannot occur with the default recordsize=128KB.  If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.

The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors).  Therefore
this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem".

The problem also occurs with the following configurations:

With recordsize=512KB or 256KB, compression=off, the problem occurs only
in rarely-used configurations:
* 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors)
* 4-wide RAIDZ2 (either recordsize, either ashift)
* 5-wide RAIDZ2 with recordsize=512KB (either ashift)
* 6-wide RAIDZ2 with recordsize=512KB (either ashift)

With recordsize=1MB, compression=off, ashift=9 (512B sectors)
* RAIDZ1 with 4 or 8 disks
* RAIDZ2 with 4, 8, or 10 disks
* RAIDZ3 with 6, 8, 9, or 10 disks

With recordsize=1MB, compression=off, ashift=12 (4KB sectors)
* RAIDZ1 with 7 or 8 disks
* RAIDZ2 with 4, 5, or 10 disks
* RAIDZ3 with 6, 9, or 10 disks

With recordsize=2MB and larger (which can only be selected by changing
kernel tunables), many configurations are affected, including with
higher numbers of disks (up to 18 disks with recordsize=2MB).

Increase zfs_vdev_aggregation_limit to allow the optional zio to be
aggregated, thus eliminating the problem.  Setting it to 256KB fixes all
commonly-used configurations.

The solution is to aggregate optional zio's regardless of the
aggregation size limit.

Analysis sponsored by Intel Corp.

OpenZFS-issue: https://www.illumos.org/issues/8005
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321
Closes #5931
2017-04-10 15:21:45 -07:00
.github Commit message format in contributing guidelines 2017-03-31 09:33:38 -07:00
cmd Fix coverity defects: CID 161288 2017-04-06 13:18:22 -07:00
config OpenZFS 7290 - ZFS test suite needs to control what utilities it can run 2017-04-06 09:25:36 -07:00
contrib Fix initramfs hook for merged /usr/lib and /lib 2017-02-27 12:03:23 -08:00
etc OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space 2017-03-07 09:51:59 -08:00
include OpenZFS 2932 - support crash dumps to raidz, etc. pools 2017-04-10 10:24:17 -07:00
lib OpenZFS 5380 - receive of a send -p stream doesn't need to try renaming snapshots 2017-04-09 16:09:16 -07:00
man OpenZFS 2932 - support crash dumps to raidz, etc. pools 2017-04-10 10:24:17 -07:00
module OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations 2017-04-10 15:21:45 -07:00
rpm Fix powerpc build 2017-03-06 09:17:24 -08:00
scripts Correct shellcheck make recipe 2017-04-06 17:16:41 -07:00
tests OpenZFS 2932 - support crash dumps to raidz, etc. pools 2017-04-10 10:24:17 -07:00
udev Fix spelling 2017-01-03 11:31:18 -06:00
.gitignore OpenZFS 7290 - ZFS test suite needs to control what utilities it can run 2017-04-06 09:25:36 -07:00
.gitmodules Add zimport.sh compatibility test script 2014-02-21 12:10:31 -08:00
AUTHORS Add a missing > to AUTHORS 2014-09-02 14:18:53 -07:00
autogen.sh build: do not call boilerplate ourself 2013-04-02 10:55:20 -07:00
configure.ac OpenZFS 7290 - ZFS test suite needs to control what utilities it can run 2017-04-06 09:25:36 -07:00
copy-builtin Allow c99 when building ZFS in the kernel tree 2017-03-27 12:31:15 -07:00
COPYRIGHT Update ZED copyright boilerplate 2015-05-11 15:07:00 -07:00
DISCLAIMER Fix minor typos and update marketing copy. 2013-03-21 12:51:06 -07:00
Makefile.am Correct shellcheck make recipe 2017-04-06 17:16:41 -07:00
META Tag 0.7.0-rc3 2017-01-20 10:18:28 -08:00
OPENSOLARIS.LICENSE Add CDDL license file 2008-12-01 14:49:34 -08:00
README.markdown Add CONTRIBUTING information and templates 2016-12-09 12:48:12 -07:00
TEST Skip xfstests on Amazon Linux 2017-04-06 17:15:30 -07:00
zfs-script-config.sh.in Add auto-online test for ZED/FMA as part of the ZTS 2017-02-28 16:25:39 -08:00
zfs.release.in Move zfs.release generation to configure step 2012-07-12 12:22:51 -07:00

ZFS is an advanced file system and volume manager which was originally developed for Solaris and is now maintained by the Illumos community.

ZFS on Linux, which is also known as ZoL, is currently feature complete. It includes fully functional and stable SPA, DMU, ZVOL, and ZPL layers. And it's native!

Official Resources

Installation

Full documentation for installing ZoL on your favorite Linux distribution can be found at our site.

Contribute & Develop

We have a separate document with contribution guidelines.