ddt: dedup log

Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
This commit is contained in:
Rob Norris
2023-06-22 17:46:22 +10:00
committed by Brian Behlendorf
parent cbb9ef0a4c
commit cd69ba3d49
17 changed files with 1621 additions and 131 deletions
+82
View File
@@ -974,6 +974,88 @@ milliseconds until the operation completes.
.It Sy zfs_dedup_prefetch Ns = Ns Sy 0 Ns | Ns 1 Pq int
Enable prefetching dedup-ed blocks which are going to be freed.
.
.It Sy zfs_dedup_log_flush_passes_max Ns = Ns Sy 8 Ns Pq uint
Maximum number of dedup log flush passes (iterations) each transaction.
.Pp
At the start of each transaction, OpenZFS will estimate how many entries it
needs to flush out to keep up with the change rate, taking the amount and time
taken to flush on previous txgs into account (see
.Sy zfs_dedup_log_flush_flow_rate_txgs ) .
It will spread this amount into a number of passes.
At each pass, it will use the amount already flushed and the total time taken
by flushing and by other IO to recompute how much it should do for the remainder
of the txg.
.Pp
Reducing the max number of passes will make flushing more aggressive, flushing
out more entries on each pass.
This can be faster, but also more likely to compete with other IO.
Increasing the max number of passes will put fewer entries onto each pass,
keeping the overhead of dedup changes to a minimum but possibly causing a large
number of changes to be dumped on the last pass, which can blow out the txg
sync time beyond
.Sy zfs_txg_timeout .
.
.It Sy zfs_dedup_log_flush_min_time_ms Ns = Ns Sy 1000 Ns Pq uint
Minimum time to spend on dedup log flush each transaction.
.Pp
At least this long will be spent flushing dedup log entries each transaction,
up to
.Sy zfs_txg_timeout .
This occurs even if doing so would delay the transaction, that is, other IO
completes under this time.
.
.It Sy zfs_dedup_log_flush_entries_min Ns = Ns Sy 1000 Ns Pq uint
Flush at least this many entries each transaction.
.Pp
OpenZFS will estimate how many entries it needs to flush each transaction to
keep up with the ingest rate (see
.Sy zfs_dedup_log_flush_flow_rate_txgs ) .
This sets the minimum for that estimate.
Raising it can force OpenZFS to flush more aggressively, keeping the log small
and so reducing pool import times, but can make it less able to back off if
log flushing would compete with other IO too much.
.
.It Sy zfs_dedup_log_flush_flow_rate_txgs Ns = Ns Sy 10 Ns Pq uint
Number of transactions to use to compute the flow rate.
.Pp
OpenZFS will estimate how many entries it needs to flush each transaction by
monitoring the number of entries changed (ingest rate), number of entries
flushed (flush rate) and time spent flushing (flush time rate) and combining
these into an overall "flow rate".
It will use an exponential weighted moving average over some number of recent
transactions to compute these rates.
This sets the number of transactions to compute these averages over.
Setting it higher can help to smooth out the flow rate in the face of spiky
workloads, but will take longer for the flow rate to adjust to a sustained
change in the ingress rate.
.
.It Sy zfs_dedup_log_txg_max Ns = Ns Sy 8 Ns Pq uint
Max transactions to before starting to flush dedup logs.
.Pp
OpenZFS maintains two dedup logs, one receiving new changes, one flushing.
If there is nothing to flush, it will accumulate changes for no more than this
many transactions before switching the logs and starting to flush entries out.
.
.It Sy zfs_dedup_log_mem_max Ns = Ns Sy 0 Ns Pq u64
Max memory to use for dedup logs.
.Pp
OpenZFS will spend no more than this much memory on maintaining the in-memory
dedup log.
Flushing will begin when around half this amount is being spent on logs.
The default value of
.Sy 0
will cause it to be set by
.Sy zfs_dedup_log_mem_max_percent
instead.
.
.It Sy zfs_dedup_log_mem_max_percent Ns = Ns Sy 1 Ns % Pq uint
Max memory to use for dedup logs, as a percentage of total memory.
.Pp
If
.Sy zfs_dedup_log_mem_max
is not set, it will be initialised as a percentage of the total memory in the
system.
.
.It Sy zfs_delay_min_dirty_percent Ns = Ns Sy 60 Ns % Pq uint
Start to delay each transaction once there is this amount of dirty data,
expressed as a percentage of