mirror_zfs/cmd
Matthew Ahrens aa755b3549
Set aside a metaslab for ZIL blocks
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11389
2021-01-21 15:12:54 -08:00
..
arc_summary arc_summary3: Handle overflowing value width 2020-12-11 10:29:53 -08:00
arcstat FreeBSD: Update usage of py-sysctl 2020-12-10 15:28:31 -08:00
dbufstat dbufstat: Fix warnings with Python 3.8 2020-12-23 15:10:35 -08:00
fsck_zfs Fix typos in cmd/ 2019-08-30 09:43:30 -07:00
mount_zfs Re-apply path sanitizer, as mount(8) still mangles it 2021-01-19 11:57:31 -08:00
raidz_test allow callers to allocate and provide the abd_t struct 2021-01-20 11:24:37 -08:00
vdev_id Silence 'make checkbashisms' 2020-08-20 13:45:47 -07:00
zdb Set aside a metaslab for ZIL blocks 2021-01-21 15:12:54 -08:00
zed ZED/zfs-list-cacher.sh: don't exit on ignored event type 2020-12-18 09:34:10 -08:00
zfs Use the correct return type for getopt 2020-12-17 10:19:30 -08:00
zfs_ids_to_path Use the correct return type for getopt 2020-12-17 10:19:30 -08:00
zgenhostid Install zgenhostid to sbindir 2021-01-21 12:58:24 -08:00
zhack nvlist leaked in zpool_find_config() 2020-12-28 10:05:31 -08:00
zinject Use abs_top_builddir when referencing libraries 2020-07-10 14:26:32 -07:00
zpool record ioctl elapsed time in zpool history 2021-01-11 09:29:25 -08:00
zpool_influxdb zpool_influxdb: move to libexec dir 2020-11-28 11:15:57 -08:00
zstream Use the correct return type for getopt 2020-12-17 10:19:30 -08:00
zstreamdump Minor zstream redup command fixes 2020-04-10 21:10:09 -07:00
ztest ztest: Clean up use of ASSERT and VERIFY 2021-01-12 17:21:01 -08:00
zvol_id Replace ZFS on Linux references with OpenZFS 2020-10-08 20:10:13 -07:00
zvol_wait zvol_wait should ignore redacted zvols 2019-11-06 10:51:19 -08:00
Makefile.am Add zpool_influxdb command 2020-10-09 09:29:21 -07:00