mirror_zfs/module/zfs
Paul Dagnelie 492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
..
abd.c Update build system and packaging 2018-05-29 16:00:33 -07:00
aggsum.c Fix preemptible warning in aggsum_add() 2018-06-07 15:55:11 -07:00
arc.c OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system 2018-07-30 11:30:41 -07:00
blkptr.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
bplist.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
bpobj.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
bptree.c Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
bqueue.c Call cv_signal() with mutex held 2017-06-26 14:36:49 -07:00
cityhash.c OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
dbuf_stats.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dbuf.c OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
ddt_zap.c Update build system and packaging 2018-05-29 16:00:33 -07:00
ddt.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dmu_diff.c Fix issues found with zfs diff 2018-05-01 11:24:20 -07:00
dmu_object.c OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth times 2018-07-30 09:27:49 -07:00
dmu_objset.c OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
dmu_send.c Fix 'zfs recv' of non large_dnode send streams 2018-06-28 14:55:11 -07:00
dmu_traverse.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dmu_tx.c Introduce kstat dmu_tx_dirty_frees_delay 2018-07-25 09:52:27 -07:00
dmu_zfetch.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dmu.c Introduce kstat dmu_tx_dirty_frees_delay 2018-07-25 09:52:27 -07:00
dnode_sync.c OpenZFS 9439 - ZFS double-free due to failure to dirty indirect block 2018-07-30 09:28:09 -07:00
dnode.c OpenZFS 9439 - ZFS double-free due to failure to dirty indirect block 2018-07-30 09:28:09 -07:00
dsl_bookmark.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dsl_crypt.c Add support for decryption faults in zinject 2018-05-02 15:36:20 -07:00
dsl_dataset.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_deadlist.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_deleg.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dsl_destroy.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_dir.c OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system 2018-07-30 11:30:41 -07:00
dsl_pool.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_prop.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dsl_scan.c dsl_scan_scrub_cb: don't double-account non-embedded blocks 2018-07-24 09:33:56 -07:00
dsl_synctask.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_userhold.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
edonr_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
fm.c OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations 2018-07-30 11:30:03 -07:00
gzip.c Update build system and packaging 2018-05-29 16:00:33 -07:00
hkdf.c Encryption patch follow-up 2017-10-11 16:54:48 -04:00
lz4.c Fix LZ4_uncompress_unknownOutputSize caused panic 2017-05-19 13:45:46 -07:00
lzjb.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
Makefile.in OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
metaslab.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
mmp.c Update build system and packaging 2018-05-29 16:00:33 -07:00
multilist.c Update build system and packaging 2018-05-29 16:00:33 -07:00
pathname.c Update build system and packaging 2018-05-29 16:00:33 -07:00
policy.c Take user namespaces into account in policy checks 2018-03-07 15:40:42 -08:00
qat_compress.c Fix inst_num overflow in qat_crypt.c 2018-05-01 20:44:24 -07:00
qat_crypt.c Fix inst_num overflow in qat_crypt.c 2018-05-01 20:44:24 -07:00
qat.c SHA256 QAT acceleration 2018-03-15 10:53:58 -07:00
qat.h Resolve QAT issues with incompressible data 2018-03-29 17:40:34 -07:00
range_tree.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
refcount.c Linux 4.11 compat: avoid refcount_t name conflict 2017-02-28 16:10:18 -08:00
rrwlock.c Fix spelling 2017-01-03 11:31:18 -06:00
sa.c Fix kernel unaligned access on sparc64 2018-07-11 13:10:40 -07:00
sha256.c SHA256 QAT acceleration 2018-03-15 10:53:58 -07:00
skein_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
spa_boot.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_checkpoint.c OpenZFS 9238 - ZFS Spacemap Encoding V2 2018-07-05 12:02:34 -07:00
spa_config.c OpenZFS 9591 - ms_shift can be incorrectly changed 2018-06-21 09:35:26 -07:00
spa_errlog.c Update build system and packaging 2018-05-29 16:00:33 -07:00
spa_history.c Update build system and packaging 2018-05-29 16:00:33 -07:00
spa_misc.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
spa_stats.c Add pool state /proc entry, "SUSPENDED" pools 2018-06-06 09:33:54 -07:00
spa.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
space_map.c OpenZFS 9442 - decrease indirect block size of spacemaps 2018-07-25 14:11:35 -07:00
space_reftree.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
THIRDPARTYLICENSE.cityhash OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
THIRDPARTYLICENSE.cityhash.descrip OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
trace.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
txg.c OpenZFS 9464 - txg_kick() fails to see that we are quiescing 2018-06-04 14:56:06 -07:00
uberblock.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
unique.c Performance optimization of AVL tree comparator functions 2016-08-31 14:35:34 -07:00
vdev_cache.c Update build system and packaging 2018-05-29 16:00:33 -07:00
vdev_disk.c Add support for autoexpand property 2018-07-23 15:40:15 -07:00
vdev_file.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_indirect_births.c Update build system and packaging 2018-05-29 16:00:33 -07:00
vdev_indirect_mapping.c OpenZFS 9238 - ZFS Spacemap Encoding V2 2018-07-05 12:02:34 -07:00
vdev_indirect.c OpenZFS 9238 - ZFS Spacemap Encoding V2 2018-07-05 12:02:34 -07:00
vdev_label.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
vdev_mirror.c Update build system and packaging 2018-05-29 16:00:33 -07:00
vdev_missing.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_queue.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
vdev_raidz_math_aarch64_neon_common.h ABD raidz NEON support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_aarch64_neon.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_aarch64_neonx2.c ABD raidz NEON support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx2.c ABD raidz avx512f support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx512bw.c ABD: Adapt avx512bw raidz assembly 2016-12-15 17:31:33 -08:00
vdev_raidz_math_avx512f.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
vdev_raidz_math_impl.h codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_scalar.c ABD Vectorized raidz 2016-11-29 14:34:33 -08:00
vdev_raidz_math_sse2.c ABD raidz avx512f support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_ssse3.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math.c Update build system and packaging 2018-05-29 16:00:33 -07:00
vdev_raidz.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_removal.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
vdev_root.c OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery 2018-05-08 21:35:27 -07:00
vdev.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
zap_leaf.c OpenZFS 9328 - zap code can take advantage of c99 2018-05-31 10:53:11 -07:00
zap_micro.c OpenZFS 9329 - panic in zap_leaf_lookup() due to concurrent zapification 2018-05-31 10:53:49 -07:00
zap.c OpenZFS 9328 - zap code can take advantage of c99 2018-05-31 10:53:11 -07:00
zcp_get.c Fix coverity defects: zfs channel programs 2018-02-20 11:19:42 -08:00
zcp_global.c OpenZFS 8600 - ZFS channel programs - snapshot 2018-02-08 15:29:24 -08:00
zcp_iter.c OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
zcp_synctask.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
zcp.c OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')" 2018-07-10 21:29:23 -07:00
zfeature.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zfs_acl.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_byteswap.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ctldir.c Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zfs_debug.c enable zfs_dbgmsg() by default, without dprintf() 2018-03-21 15:37:32 -07:00
zfs_dir.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_fm.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_fuid.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ioctl.c OpenZFS 8906 - uts: illumos rootfs should support salted cksum 2018-07-27 08:35:28 -07:00
zfs_log.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_onexit.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ratelimit.c Change checksum & IO delay ratelimit values 2018-03-04 17:34:51 -08:00
zfs_replay.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_rlock.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_sa.c Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_vfsops.c OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
zfs_vnops.c Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zfs_znode.c Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zil.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
zio_checksum.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zio_compress.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zio_crypt.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zio_inject.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zio.c OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
zle.c Fix zle_decompress out of bound access 2018-02-09 10:08:05 -08:00
zpl_ctldir.c RHEL 7.5 compat: FMODE_KABI_ITERATE 2018-05-02 15:01:24 -07:00
zpl_export.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
zpl_file.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zpl_inode.c Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zpl_super.c Fix zpl_mount() deadlock 2018-07-11 15:49:10 -07:00
zpl_xattr.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zrlock.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zthr.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
zvol.c Linux compat 4.18: check_disk_size_change() 2018-06-15 15:05:21 -07:00