ddt: dedup log

Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
This commit is contained in:
Rob Norris
2023-06-22 17:46:22 +10:00
committed by Brian Behlendorf
parent cbb9ef0a4c
commit cd69ba3d49
17 changed files with 1621 additions and 131 deletions
+1
View File
@@ -31,6 +31,7 @@ DBUF_CACHE_SHIFT dbuf.cache_shift dbuf_cache_shift
DDT_ZAP_DEFAULT_BS dedup.ddt_zap_default_bs ddt_zap_default_bs
DDT_ZAP_DEFAULT_IBS dedup.ddt_zap_default_ibs ddt_zap_default_ibs
DDT_DATA_IS_SPECIAL ddt_data_is_special zfs_ddt_data_is_special
DEDUP_LOG_TXG_MAX dedup.log_txg_max zfs_dedup_log_txg_max
DEADMAN_CHECKTIME_MS deadman.checktime_ms zfs_deadman_checktime_ms
DEADMAN_EVENTS_PER_SECOND deadman_events_per_second zfs_deadman_events_per_second
DEADMAN_FAILMODE deadman.failmode zfs_deadman_failmode
@@ -29,9 +29,16 @@
log_assert "basic dedup (FDT) operations work"
# we set the dedup log txg interval to 1, to get a log flush every txg,
# effectively disabling the log. without this it's hard to predict when and
# where things appear on-disk
log_must save_tunable DEDUP_LOG_TXG_MAX
log_must set_tunable32 DEDUP_LOG_TXG_MAX 1
function cleanup
{
destroy_pool $TESTPOOL
log_must restore_tunable DEDUP_LOG_TXG_MAX
}
log_onexit cleanup
@@ -29,9 +29,16 @@
log_assert "dedup (FDT) retains version after import"
# we set the dedup log txg interval to 1, to get a log flush every txg,
# effectively disabling the log. without this it's hard to predict when and
# where things appear on-disk
log_must save_tunable DEDUP_LOG_TXG_MAX
log_must set_tunable32 DEDUP_LOG_TXG_MAX 1
function cleanup
{
destroy_pool $TESTPOOL
log_must restore_tunable DEDUP_LOG_TXG_MAX
}
log_onexit cleanup
@@ -30,9 +30,16 @@
log_assert "legacy and FDT dedup tables on the same pool can happily coexist"
# we set the dedup log txg interval to 1, to get a log flush every txg,
# effectively disabling the log. without this it's hard to predict when and
# where things appear on-disk
log_must save_tunable DEDUP_LOG_TXG_MAX
log_must set_tunable32 DEDUP_LOG_TXG_MAX 1
function cleanup
{
destroy_pool $TESTPOOL
log_must restore_tunable DEDUP_LOG_TXG_MAX
}
log_onexit cleanup
@@ -30,9 +30,16 @@
log_assert "legacy dedup tables work after upgrade; new dedup tables created as FDT"
# we set the dedup log txg interval to 1, to get a log flush every txg,
# effectively disabling the log. without this it's hard to predict when and
# where things appear on-disk
log_must save_tunable DEDUP_LOG_TXG_MAX
log_must set_tunable32 DEDUP_LOG_TXG_MAX 1
function cleanup
{
destroy_pool $TESTPOOL
log_must restore_tunable DEDUP_LOG_TXG_MAX
}
log_onexit cleanup
@@ -51,6 +51,12 @@ POOL="dedup_pool"
save_tunable TXG_TIMEOUT
# we set the dedup log txg interval to 1, to get a log flush every txg,
# effectively disabling the log. without this it's hard to predict when and
# where things appear on-disk
log_must save_tunable DEDUP_LOG_TXG_MAX
log_must set_tunable32 DEDUP_LOG_TXG_MAX 1
function cleanup
{
if poolexists $POOL ; then
@@ -58,6 +64,7 @@ function cleanup
fi
log_must rm -fd $VDEV_GENERAL $VDEV_DEDUP $MOUNTDIR
log_must restore_tunable TXG_TIMEOUT
log_must restore_tunable DEDUP_LOG_TXG_MAX
}
@@ -206,10 +213,15 @@ function ddt_dedup_vdev_limit
#
# With no DDT quota in place, the above workload will produce over
# 800,000 entries by using space in the normal class. With a quota,
# it will be well below 500,000 entries.
# 800,000 entries by using space in the normal class. With a quota, it
# should be well under 500,000. However, logged entries are hard to
# account for because they can appear on both logs, and can also
# represent an eventual removal. This isn't easily visible from
# outside, and even internally can result in going slightly over quota.
# For here, we just set the entry count a little higher than what we
# expect to allow for some instability.
#
log_must test $(ddt_entries) -le 500000
log_must test $(ddt_entries) -le 600000
do_clean
}