mirror_zfs/include/sys
Paul Dagnelie 492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
..
crypto OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
fm Update build system and packaging 2018-05-29 16:00:33 -07:00
fs Enforce PROP_ONETIME on zpool properties 2018-06-28 14:49:17 -07:00
lua Fix coverity defects: zfs channel programs 2018-02-20 11:19:42 -08:00
sysevent OpenZFS 8959 - Add notifications when a scrub is paused or resumed 2018-01-17 10:31:00 -08:00
abd.h Update build system and packaging 2018-05-29 16:00:33 -07:00
aggsum.h OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
arc_impl.h Add support for decryption faults in zinject 2018-05-02 15:36:20 -07:00
arc.h OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system 2018-07-30 11:30:41 -07:00
avl_impl.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
avl.h Remove dead code from AVL tree 2017-10-05 19:28:00 -07:00
blkptr.h OpenZFS 8067 - zdb should be able to dump literal embedded block pointer 2017-07-07 11:28:01 -07:00
bplist.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
bpobj.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
bptree.h Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t 2014-08-06 14:48:41 -07:00
bqueue.h Illumos 5960, 5925 2016-01-08 15:08:19 -08:00
cityhash.h OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
dbuf.h OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
ddt.h Incorrect maximum DVA value in DDE_GET_NDVAS() 2018-02-26 14:20:12 -08:00
dmu_impl.h Fix race in dnode_check_slots_free() 2018-04-10 11:15:05 -07:00
dmu_objset.h OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
dmu_send.h Raw receive should change key atomically 2018-02-21 12:31:03 -08:00
dmu_traverse.h Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
dmu_tx.h Introduce kstat dmu_tx_dirty_frees_delay 2018-07-25 09:52:27 -07:00
dmu_zfetch.h OpenZFS 6322 - ZFS indirect block predictive prefetch 2016-08-30 14:26:55 -07:00
dmu.h OpenZFS 9442 - decrease indirect block size of spacemaps 2018-07-25 14:11:35 -07:00
dnode.h OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
dsl_bookmark.h Illumos 4368, 4369. 2014-07-29 10:55:29 -07:00
dsl_crypt.h Add support for decryption faults in zinject 2018-05-02 15:36:20 -07:00
dsl_dataset.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_deadlist.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_deleg.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_destroy.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
dsl_dir.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_pool.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_prop.h Illumos 6171 - dsl_prop_unregister() slows down dataset eviction. 2016-01-12 10:53:12 -08:00
dsl_scan.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_synctask.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
dsl_userhold.h Illumos #3740 2013-11-04 11:17:48 -08:00
edonr.h OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
efi_partition.h Fix spelling 2017-01-03 11:31:18 -06:00
frame.h Suppress incorrect objtool warnings 2017-12-07 10:28:50 -08:00
hkdf.h Encryption patch follow-up 2017-10-11 16:54:48 -04:00
Makefile.am OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
metaslab_impl.h OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
metaslab.h OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
mmp.h Record skipped MMP writes in multihost_history 2018-03-06 15:15:15 -08:00
mntent.h Make zfs mount according to relatime config in dataset 2016-04-05 18:55:59 -07:00
multilist.h OpenZFS 7968 - multi-threaded spa_sync() 2017-03-20 18:36:00 -07:00
note.h Update build system and packaging 2018-05-29 16:00:33 -07:00
nvpair_impl.h OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations 2018-07-30 11:30:03 -07:00
nvpair.h OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations 2018-07-30 11:30:03 -07:00
pathname.h Add pn_alloc()/pn_free() functions 2016-04-21 09:49:25 -07:00
policy.h Add zfs allow and zfs unallow support 2016-06-07 09:16:52 -07:00
range_tree.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
refcount.h OpenZFS 8081 - Compiler warnings in zdb 2017-10-27 12:46:35 -07:00
rrwlock.h Illumos 5008 - lock contention (rrw_exit) while running a read only load 2015-07-06 09:34:13 -07:00
sa_impl.h Implement large_dnode pool feature 2016-06-24 13:13:21 -07:00
sa.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
sdt.h Add line info and SET_ERROR() to ZFS debug log 2017-07-25 23:09:48 -07:00
sha2.h OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
skein.h OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R 2016-10-03 14:51:15 -07:00
spa_boot.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
spa_checkpoint.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
spa_checksum.h Implementation of AVX2 optimized Fletcher-4 2016-06-02 14:30:51 -07:00
spa_impl.h OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
spa.h OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system 2018-07-30 11:30:41 -07:00
space_map.h OpenZFS 9238 - ZFS Spacemap Encoding V2 2018-07-05 12:02:34 -07:00
space_reftree.h Illumos #4101, #4102, #4103, #4105, #4106 2014-07-22 09:39:16 -07:00
sysevent.h OpenZFS 6939 - add sysevents to zfs core for commands 2017-07-12 21:28:13 -07:00
trace_acl.h Linux 4.16 compat: inode_set_iversion() 2018-02-08 21:25:19 -08:00
trace_arc.h Support re-prioritizing asynchronous prefetches 2017-12-21 09:13:06 -08:00
trace_common.h OpenZFS 6531 - Provide mechanism to artificially limit disk performance 2016-05-26 10:11:51 -07:00
trace_dbgmsg.h Add line info and SET_ERROR() to ZFS debug log 2017-07-25 23:09:48 -07:00
trace_dbuf.h Crash in dbuf_evict_one with DTRACE_PROBE 2017-08-09 11:04:41 -07:00
trace_dmu.h tx_waited -> tx_dirty_delayed in trace_dmu.h 2018-01-31 16:13:26 -08:00
trace_dnode.h Fix build-it compilation regression 2017-01-24 08:50:15 -08:00
trace_multilist.h Fix build-it compilation regression 2017-01-24 08:50:15 -08:00
trace_txg.h Fix build-it compilation regression 2017-01-24 08:50:15 -08:00
trace_vdev.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
trace_zil.h OpenZFS 8585 - improve batching done in zil_commit() 2017-12-05 09:39:16 -08:00
trace_zio.h Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
trace_zrlock.h Fix race in trace point in zrl_add_impl 2018-03-12 11:27:02 -07:00
trace.h Remove duplicate typedefs from trace.h 2015-01-06 16:53:24 -08:00
txg_impl.h OpenZFS 9464 - txg_kick() fails to see that we are quiescing 2018-06-04 14:56:06 -07:00
txg.h OpenZFS 8063 - verify that we do not attempt to access inactive txg 2017-05-10 13:52:22 -04:00
u8_textprep_data.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
u8_textprep.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
uberblock_impl.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
uberblock.h Multi-modifier protection (MMP) 2017-07-13 13:54:00 -04:00
uio_impl.h Add basic uio support 2011-02-10 09:21:43 -08:00
unique.h Illumos #3742 2013-11-04 10:55:25 -08:00
uuid.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
vdev_disk.h Add support for autoexpand property 2018-07-23 15:40:15 -07:00
vdev_file.h Use a dedicated taskq for vdev_file 2016-12-21 10:47:15 -08:00
vdev_impl.h OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
vdev_indirect_births.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_indirect_mapping.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
vdev_raidz_impl.h Revert raidz_map and _col structure types 2018-01-09 14:46:52 -08:00
vdev_raidz.h Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
vdev_removal.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
vdev.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
xvattr.h Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zap_impl.h OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space 2017-03-07 09:51:59 -08:00
zap_leaf.h Fix ENOSPC in "Handle zap_add() failures in ..." 2018-04-18 14:19:50 -07:00
zap.h OpenZFS 1300 - filename normalization doesn't work for removes 2017-02-02 14:13:41 -08:00
zcp_global.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
zcp_iter.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
zcp_prop.h OpenZFS 7431 - ZFS Channel Programs 2018-02-08 15:28:18 -08:00
zcp.h Add tunables for channel programs 2018-06-15 15:10:42 -07:00
zfeature.h Revert "zhack: Add 'feature disable' command" 2016-05-17 11:52:07 -07:00
zfs_acl.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_context.h Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zfs_ctldir.h Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_debug.h OpenZFS 9236 - nuke spa_dbgmsg 2018-04-30 10:19:48 -07:00
zfs_delay.h Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_dir.h Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_fuid.h Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ioctl.h OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
zfs_onexit.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
zfs_project.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_ratelimit.h Change checksum & IO delay ratelimit values 2018-03-04 17:34:51 -08:00
zfs_rlock.h Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_sa.h Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_stat.h Support custom build directories and move includes 2010-09-08 12:38:56 -07:00
zfs_vfsops.h Fix zpl_mount() deadlock 2018-07-11 15:49:10 -07:00
zfs_vnops.h RHEL 7.5 compat: FMODE_KABI_ITERATE 2018-05-02 15:01:24 -07:00
zfs_znode.h Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zil_impl.h OpenZFS 8909 - 8585 can cause a use-after-free kernel panic 2017-12-28 10:18:04 -08:00
zil.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
zio_checksum.h Remove dependency on linear ABD 2017-03-29 12:24:51 -07:00
zio_compress.h DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
zio_crypt.h Add support for decryption faults in zinject 2018-05-02 15:36:20 -07:00
zio_impl.h Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
zio_priority.h OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
zio.h OpenZFS 9112 - Improve allocation performance on high-end systems 2018-07-31 10:52:33 -07:00
zpl.h Linux 4.18 compat: inode timespec -> timespec64 2018-06-19 21:51:18 -07:00
zrlock.h OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
zthr.h OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
zvol.h Add port of FreeBSD 'volmode' property 2017-07-12 13:05:37 -07:00