mirror_zfs/module/zfs
Etienne Dechamps 920dd524fb Add FASTWRITE algorithm for synchronous writes.
Currently, ZIL blocks are spread over vdevs using hint block pointers
managed by the ZIL commit code and passed to metaslab_alloc(). Spreading
log blocks accross vdevs is important for performance: indeed, using
mutliple disks in parallel decreases the ZIL commit latency, which is
the main performance metric for synchronous writes. However, the current
implementation suffers from the following issues:

1) It would be best if the ZIL module was not aware of such low-level
details. They should be handled by the ZIO and metaslab modules;

2) Because the hint block pointer is managed per log, simultaneous
commits from multiple logs might use the same vdevs at the same time,
which is inefficient;

3) Because dmu_write() does not honor the block pointer hint, indirect
writes are not spread.

The naive solution of rotating the metaslab rotor each time a block is
allocated for the ZIL or dmu_sync() doesn't work in practice because the
first ZIL block to be written is actually allocated during the previous
commit. Consequently, when metaslab_alloc() decides the vdev for this
block, it will do so while a bunch of other allocations are happening at
the same time (from dmu_sync() and other ZILs). This means the vdev for
this block is chosen more or less at random. When the next commit
happens, there is a high chance (especially when the number of blocks
per commit is slightly less than the number of the disks) that one disk
will have to write two blocks (with a potential seek) while other disks
are sitting idle, which defeats spreading and increases the commit
latency.

This commit introduces a new concept in the metaslab allocator:
fastwrites. Basically, each top-level vdev maintains a counter
indicating the number of synchronous writes (from dmu_sync() and the
ZIL) which have been allocated but not yet completed. When the metaslab
is called with the FASTWRITE flag, it will choose the vdev with the
least amount of pending synchronous writes. If there are multiple vdevs
with the same value, the first matching vdev (starting from the rotor)
is used. Once metaslab_alloc() has decided which vdev the block is
allocated to, it updates the fastwrite counter for this vdev.

The rationale goes like this: when an allocation is done with
FASTWRITE, it "reserves" the vdev until the data is written. Until then,
all future allocations will naturally avoid this vdev, even after a full
rotation of the rotor. As a result, pending synchronous writes at a
given point in time will be nicely spread over all vdevs. This contrasts
with the previous algorithm, which is based on the implicit assumption
that blocks are written instantaneously after they're allocated.

metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to
manually increase or decrease fastwrite counters, respectively. They
should be used with caution, as there is no per-BP tracking of fastwrite
information, so leaks and "double-unmarks" are possible. There is,
however, an assert in the vdev teardown code which will fire if the
fastwrite counters are not zero when the pool is exported or the vdev
removed. Note that as stated above, marking is also done implictly by
metaslab_alloc().

ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to
the metaslab when allocating (assuming ZIO does the allocation, which is
only true in the case of dmu_sync). This flag will also trigger an
unmark when zio_done() fires.

A side-effect of the new algorithm is that when a ZIL stops being used,
its last block can stay in the pending state (allocated but not yet
written) for a long time, polluting the fastwrite counters. To avoid
that, I've implemented a somewhat crude but working solution which
unmarks these pending blocks in zil_sync(), thus guaranteeing that
linguering fastwrites will get pruned at each sync event.

The best performance improvements are observed with pools using a large
number of top-level vdevs and heavy synchronous write workflows
(especially indirect writes and concurrent writes from multiple ZILs).
Real-life testing shows a 200% to 300% performance increase with
indirect writes and various commit sizes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
2012-10-17 08:56:41 -07:00
..
arc.c Remove vmem_size() consumers 2012-10-12 10:03:03 -07:00
bplist.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
bpobj.c Illumos #1644, #1645, #1646, #1647, #1708 2012-07-31 09:25:30 -07:00
dbuf.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
ddt_zap.c Switch KM_SLEEP to KM_PUSHPAGE 2012-10-04 10:44:09 -07:00
ddt.c Switch KM_SLEEP to KM_PUSHPAGE 2012-09-19 11:52:36 -07:00
dmu_diff.c Update to onnv_147 2010-08-26 14:24:34 -07:00
dmu_object.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
dmu_objset.c Switch KM_SLEEP to KM_PUSHPAGE 2012-10-08 10:19:05 -07:00
dmu_send.c Illumos #2703: add mechanism to report ZFS send progress 2012-09-19 13:39:06 -07:00
dmu_traverse.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
dmu_tx.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
dmu_zfetch.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
dmu.c Add FASTWRITE algorithm for synchronous writes. 2012-10-17 08:56:41 -07:00
dnode_sync.c Fix dbuf eviction assertion 2010-08-31 08:38:45 -07:00
dnode.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
dsl_dataset.c Illumos #3100: zvol rename fails with EBUSY when dirty. 2012-10-03 13:59:02 -07:00
dsl_deadlist.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
dsl_deleg.c Illumos #1644, #1645, #1646, #1647, #1708 2012-07-31 09:25:30 -07:00
dsl_dir.c Illumos #3100: zvol rename fails with EBUSY when dirty. 2012-10-03 13:59:02 -07:00
dsl_pool.c txg is spelled as tgx in places 2012-10-11 09:19:08 -07:00
dsl_prop.c Switch KM_SLEEP to KM_PUSHPAGE 2012-09-17 11:22:23 -07:00
dsl_scan.c Fix zfs_txg_timeout module parameter 2012-10-11 15:07:09 -07:00
dsl_synctask.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
fm.c Condition variable usage, zevent_cv 2012-10-15 16:01:54 -07:00
gzip.c Fix zmod.h usage in userspace 2010-08-31 08:38:46 -07:00
lzjb.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
Makefile.in Add script for builtin module building. 2012-07-26 13:45:09 -07:00
metaslab.c Add FASTWRITE algorithm for synchronous writes. 2012-10-17 08:56:41 -07:00
refcount.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
rrwlock.c Enable rrwlock.c compilation 2010-12-07 16:05:25 -08:00
sa.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
sha256.c Add linux sha2 support 2010-08-31 13:41:59 -07:00
spa_boot.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_config.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
spa_errlog.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_history.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
spa_misc.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
spa.c Illumos #1948: zpool list should show more detailed pool info 2012-09-19 13:39:05 -07:00
space_map.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
txg.c Fix zfs_txg_timeout module parameter 2012-10-11 15:07:09 -07:00
uberblock.c Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
unique.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
vdev_cache.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
vdev_disk.c Modify vdev_elevator_switch() to use elevator_change() 2012-10-03 13:31:44 -07:00
vdev_file.c Illumos #1948: zpool list should show more detailed pool info 2012-09-19 13:39:05 -07:00
vdev_label.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
vdev_mirror.c Illumos #1948: zpool list should show more detailed pool info 2012-09-19 13:39:05 -07:00
vdev_missing.c Illumos #1948: zpool list should show more detailed pool info 2012-09-19 13:39:05 -07:00
vdev_queue.c Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE 2012-10-15 09:28:43 -07:00
vdev_raidz.c Illumos #1948: zpool list should show more detailed pool info 2012-09-19 13:39:05 -07:00
vdev_root.c Illumos #1948: zpool list should show more detailed pool info 2012-09-19 13:39:05 -07:00
vdev.c Add FASTWRITE algorithm for synchronous writes. 2012-10-17 08:56:41 -07:00
zap_leaf.c Switch KM_SLEEP to KM_PUSHPAGE 2012-09-05 08:44:58 -07:00
zap_micro.c Switch KM_SLEEP to KM_PUSHPAGE 2012-09-05 08:44:58 -07:00
zap.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
zfs_acl.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
zfs_byteswap.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
zfs_ctldir.c Return positive error number in zfsctl_shares_lookup. 2012-10-15 09:11:56 -07:00
zfs_debug.c Use spl_debug_* helpers 2012-02-09 16:37:48 -08:00
zfs_dir.c rmdir(2) should return ENOTEMPTY 2012-08-26 13:55:45 -07:00
zfs_fm.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
zfs_fuid.c Drop HAVE_XVATTR macros 2011-03-02 11:44:34 -08:00
zfs_ioctl.c Illumos #3129, #3130 2012-10-03 13:59:02 -07:00
zfs_log.c Make zfs_immediate_write_sz a module paramater 2012-10-11 11:09:21 -07:00
zfs_onexit.c Add linux kernel device support 2010-08-31 13:41:50 -07:00
zfs_replay.c ZFS replay transaction error 5 2012-09-17 11:06:58 -07:00
zfs_rlock.c Condition variable usage, zp->r_{rd,wr}_cv 2012-10-15 16:02:03 -07:00
zfs_sa.c Revert "Use SA_HDL_PRIVATE for SA xattrs" 2012-08-25 09:25:56 -07:00
zfs_vfsops.c Illumos #3100: zvol rename fails with EBUSY when dirty. 2012-10-03 13:59:02 -07:00
zfs_vnops.c Clear PG_writeback for sync I/O error case 2012-09-14 15:53:47 -07:00
zfs_znode.c Switch KM_SLEEP to KM_PUSHPAGE 2012-08-27 12:01:37 -07:00
zil.c Add FASTWRITE algorithm for synchronous writes. 2012-10-17 08:56:41 -07:00
zio_checksum.c Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
zio_compress.c Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
zio_inject.c Add missing ZFS tunables 2011-05-04 10:02:37 -07:00
zio.c Add FASTWRITE algorithm for synchronous writes. 2012-10-17 08:56:41 -07:00
zle.c Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
zpl_ctldir.c Linux 3.6 compat, iops->lookup() 2012-10-14 13:06:54 -07:00
zpl_export.c Implement .commit_metadata hook for NFS export 2012-10-03 10:49:45 -07:00
zpl_file.c Annotate KM_PUSHPAGE call paths with PF_NOFS 2012-08-27 12:01:37 -07:00
zpl_inode.c Linux 3.6 compat, iops->create() 2012-10-14 14:42:25 -07:00
zpl_super.c Linux 3.6 compat, sops->write_super() removed 2012-10-14 11:33:56 -07:00
zpl_xattr.c Add missing NULL in zpl_xattr_handlers 2012-03-15 15:18:29 -07:00
zrlock.c Export ZFS symbols needed by Lustre. 2010-09-17 16:24:15 -07:00
zvol.c Set default zvol elevator to noop 2012-10-05 12:39:59 -07:00