mirror_zfs/module/zfs
Serapheim Dimitropoulos 93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00
..
abd.c single-chunk scatter ABDs can be treated as linear 2019-06-11 09:02:31 -07:00
aggsum.c OpenZFS 9688 - aggsum_fini leaks memory 2018-10-19 12:08:03 -07:00
arc.c Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional 2019-07-03 13:01:54 -07:00
blkptr.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
bplist.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
bpobj.c Stack overflow in recursive bpobj_iterate_impl 2019-03-06 09:50:55 -08:00
bptree.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
bqueue.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
cityhash.c OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
dataset_kstats.c port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
dbuf_stats.c Prefix all refcount functions with zfs_ 2018-10-01 10:42:05 -07:00
dbuf.c Decrease contention on dn_struct_rwlock 2019-07-08 13:18:50 -07:00
ddt_zap.c fat zap should prefetch when iterating 2019-06-12 13:13:09 -07:00
ddt.c Remove dedupditto functionality 2019-06-19 14:54:02 -07:00
dmu_diff.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_object.c Fix send/recv lost spill block 2019-05-07 15:18:44 -07:00
dmu_objset.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
dmu_recv.c Allow unencrypted children of encrypted datasets 2019-06-20 12:29:51 -07:00
dmu_redact.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_send.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_traverse.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dmu_tx.c Remove code for zfs remap 2019-06-24 16:44:01 -07:00
dmu_zfetch.c Decrease contention on dn_struct_rwlock 2019-07-08 13:18:50 -07:00
dmu.c Decrease contention on dn_struct_rwlock 2019-07-08 13:18:50 -07:00
dnode_sync.c Decrease contention on dn_struct_rwlock 2019-07-08 13:18:50 -07:00
dnode.c Export dnode symbols 2019-07-15 16:11:55 -07:00
dsl_bookmark.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dsl_crypt.c Remove VERIFY from dsl_dataset_crypt_stats() 2019-07-05 16:53:14 -07:00
dsl_dataset.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dsl_deadlist.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dsl_deleg.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dsl_destroy.c Ensure dsl_destroy_head() decrypts objsets 2019-07-15 16:08:42 -07:00
dsl_dir.c Remove code for zfs remap 2019-06-24 16:44:01 -07:00
dsl_pool.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
dsl_prop.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dsl_scan.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
dsl_synctask.c OpenZFS 9425 - channel programs can be interrupted 2019-06-22 16:51:46 -07:00
dsl_userhold.c zfs should optionally send holds 2019-02-15 12:41:38 -08:00
edonr_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
fm.c OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations 2018-07-30 11:30:03 -07:00
gzip.c Update build system and packaging 2018-05-29 16:00:33 -07:00
hkdf.c Encryption patch follow-up 2017-10-11 16:54:48 -04:00
lz4.c Reword comment in lz4_compress_zfs 2019-05-02 16:46:04 -07:00
lzjb.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
Makefile.in Log Spacemap Project 2019-07-16 10:11:49 -07:00
metaslab.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
mmp.c MMP interval and fail_intervals in uberblock 2019-03-21 12:47:57 -07:00
multilist.c Avoid extra taskq_dispatch() calls by DMU 2019-06-25 12:03:38 -07:00
objlist.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
pathname.c Disable unused pathname::pn_path* (unneeded in Linux) 2019-07-15 13:57:56 -07:00
policy.c Take user namespaces into account in policy checks 2018-03-07 15:40:42 -08:00
qat_compress.c Code improvement and bug fixes for QAT support 2019-04-16 12:38:36 -07:00
qat_crypt.c Code improvement and bug fixes for QAT support 2019-04-16 12:38:36 -07:00
qat.c Code improvement and bug fixes for QAT support 2019-04-16 12:38:36 -07:00
qat.h Code improvement and bug fixes for QAT support 2019-04-16 12:38:36 -07:00
range_tree.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
refcount.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
rrwlock.c 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
sa.c Prefix all refcount functions with zfs_ 2018-10-01 10:42:05 -07:00
sha256.c SHA256 QAT acceleration 2018-03-15 10:53:58 -07:00
skein_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
spa_boot.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_checkpoint.c Get rid of space_map_update() for ms_synced_length 2019-02-12 10:38:11 -08:00
spa_config.c Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod) 2019-05-29 16:18:14 -07:00
spa_errlog.c Update build system and packaging 2018-05-29 16:00:33 -07:00
spa_history.c Create /proc/sys/kernel/spl/gitrev with git hash 2018-10-08 21:57:02 -07:00
spa_log_spacemap.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
spa_misc.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
spa_stats.c Restrict kstats and print real pointers 2019-04-04 18:57:06 -07:00
spa.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
space_map.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
space_reftree.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
THIRDPARTYLICENSE.cityhash OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
THIRDPARTYLICENSE.cityhash.descrip OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
trace.c 8659 static dtrace probes unavailable on non-GPL modules 2019-07-08 11:20:53 -07:00
txg.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
uberblock.c MMP interval and fail_intervals in uberblock 2019-03-21 12:47:57 -07:00
unique.c Performance optimization of AVL tree comparator functions 2016-08-31 14:35:34 -07:00
vdev_cache.c Update build system and packaging 2018-05-29 16:00:33 -07:00
vdev_disk.c Revert "Fail early on bio corruption confirmed on 5.2-rc1" 2019-07-05 20:38:56 -07:00
vdev_file.c Update vdev_ops_t from illumos 2019-06-20 18:29:02 -07:00
vdev_indirect_births.c Fixes: #8934 Large kmem_alloc 2019-07-10 15:54:49 -07:00
vdev_indirect_mapping.c Get rid of space_map_update() for ms_synced_length 2019-02-12 10:38:11 -08:00
vdev_indirect.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
vdev_initialize.c Add TRIM support 2019-03-29 09:13:20 -07:00
vdev_label.c panic in removal_remap test on 4K devices 2019-06-13 13:12:39 -07:00
vdev_mirror.c Update vdev_ops_t from illumos 2019-06-20 18:29:02 -07:00
vdev_missing.c Update vdev_ops_t from illumos 2019-06-20 18:29:02 -07:00
vdev_queue.c Move write aggregation memory copy out of vq_lock 2019-06-13 13:08:24 -07:00
vdev_raidz_math_aarch64_neon_common.h Linux 5.0 compat: ASM_BUG macro 2019-05-08 10:18:40 -07:00
vdev_raidz_math_aarch64_neon.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math_aarch64_neonx2.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math_avx2.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math_avx512bw.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math_avx512f.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math_impl.h codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_scalar.c ABD Vectorized raidz 2016-11-29 14:34:33 -08:00
vdev_raidz_math_sse2.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math_ssse3.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz_math.c Linux 5.0 compat: SIMD compatibility 2019-07-12 09:31:20 -07:00
vdev_raidz.c Update vdev_ops_t from illumos 2019-06-20 18:29:02 -07:00
vdev_removal.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
vdev_root.c Update vdev_ops_t from illumos 2019-06-20 18:29:02 -07:00
vdev_trim.c Add TRIM support 2019-03-29 09:13:20 -07:00
vdev.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
zap_leaf.c Off-by-one in zap_leaf_array_create() 2019-01-18 09:58:46 -08:00
zap_micro.c fat zap should prefetch when iterating 2019-06-12 13:13:09 -07:00
zap.c fat zap should prefetch when iterating 2019-06-12 13:13:09 -07:00
zcp_get.c Detect and prevent mixed raw and non-raw sends 2019-03-13 11:00:43 -07:00
zcp_global.c OpenZFS 8600 - ZFS channel programs - snapshot 2018-02-08 15:29:24 -08:00
zcp_iter.c OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
zcp_synctask.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
zcp.c OpenZFS 9425 - channel programs can be interrupted 2019-06-22 16:51:46 -07:00
zfeature.c Consistently captialize GUID for features 2019-04-16 10:01:51 -07:00
zfs_acl.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_byteswap.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ctldir.c Improve "Unable to automount" error message. 2019-07-03 13:05:02 -07:00
zfs_debug.c Restrict kstats and print real pointers 2019-04-04 18:57:06 -07:00
zfs_dir.c port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
zfs_fm.c Add zpool status -s (slow I/Os) and -p (parseable) 2018-11-08 16:47:24 -08:00
zfs_fuid.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ioctl.c Remove code for zfs remap 2019-06-24 16:44:01 -07:00
zfs_log.c make zil max block size tunable 2019-06-10 11:48:42 -07:00
zfs_onexit.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ratelimit.c Change checksum & IO delay ratelimit values 2018-03-04 17:34:51 -08:00
zfs_replay.c Use SEEK_{SET,CUR,END} for file seek "whence" 2019-04-25 10:17:27 -07:00
zfs_rlock.c OpenZFS 9689 - zfs range lock code should not be zpl-specific 2018-10-11 10:19:33 -07:00
zfs_sa.c Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_sysfs.c Prevent pointer to an out-of-scope local variable 2019-06-20 18:31:52 -07:00
zfs_vfsops.c Implement Redacted Send/Receive 2019-06-19 09:48:12 -07:00
zfs_vnops.c Update descriptions for vnops 2019-05-25 14:29:10 -07:00
zfs_znode.c Fix integer overflow of ZTOI(zp)->i_generation 2019-06-06 12:59:39 -07:00
zil.c make zil max block size tunable 2019-06-10 11:48:42 -07:00
zio_checksum.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zio_compress.c OpenZFS 9403 - assertion failed in arc_buf_destroy() 2018-08-29 11:33:33 -07:00
zio_crypt.c Always call rw_init in zio_crypt_key_unwrap 2019-04-10 15:39:40 -07:00
zio_inject.c Multiple DVA Scrubbing Fix 2019-03-15 14:14:31 -07:00
zio.c Log Spacemap Project 2019-07-16 10:11:49 -07:00
zle.c Fix zle_decompress out of bound access 2018-02-09 10:08:05 -08:00
zpl_ctldir.c RHEL 7.5 compat: FMODE_KABI_ITERATE 2018-05-02 15:01:24 -07:00
zpl_export.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
zpl_file.c Fix errant EFAULT during writes (#8719) 2019-05-08 10:04:04 -07:00
zpl_inode.c Fix errant EFAULT during writes (#8719) 2019-05-08 10:04:04 -07:00
zpl_super.c Fix statfs(2) for 32-bit user space 2018-09-24 17:11:25 -07:00
zpl_xattr.c Drop redundant POSIX ACL check in zpl_init_acl() 2019-07-15 16:26:52 -07:00
zrlock.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zthr.c Fix txg_wait_open() load average inflation 2019-04-04 09:44:46 -07:00
zvol.c Add SCSI_PASSTHROUGH to zvols to enable UNMAP support 2019-06-21 09:40:56 -07:00