2023-06-30 06:35:18 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved.
|
|
|
|
* Copyright (c) 2016 by Delphix. All rights reserved.
|
|
|
|
* Copyright (c) 2023, Klara Inc.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _SYS_DDT_IMPL_H
|
|
|
|
#define _SYS_DDT_IMPL_H
|
|
|
|
|
|
|
|
#include <sys/ddt.h>
|
ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.
Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.
A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.
The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.
All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2023-06-22 10:46:22 +03:00
|
|
|
#include <sys/bitops.h>
|
2023-06-30 06:35:18 +03:00
|
|
|
|
|
|
|
#ifdef __cplusplus
|
|
|
|
extern "C" {
|
|
|
|
#endif
|
|
|
|
|
ddt: add FDT feature and support for legacy and new on-disk formats
This is the supporting infrastructure for the upcoming dedup features.
Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.
The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.
This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.
This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.
For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.
ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.
This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15892
2023-06-20 05:06:13 +03:00
|
|
|
/* DDT version numbers */
|
|
|
|
#define DDT_VERSION_LEGACY (0)
|
|
|
|
#define DDT_VERSION_FDT (1)
|
|
|
|
|
|
|
|
/* Names of interesting objects in the DDT root dir */
|
|
|
|
#define DDT_DIR_VERSION "version"
|
|
|
|
#define DDT_DIR_FLAGS "flags"
|
|
|
|
|
2023-07-03 15:16:04 +03:00
|
|
|
/* Fill a lightweight entry from a live entry. */
|
ddt: add "flat phys" feature
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.
This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.
Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.
Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2023-06-20 04:09:48 +03:00
|
|
|
#define DDT_ENTRY_TO_LIGHTWEIGHT(ddt, dde, ddlwe) do { \
|
|
|
|
memset((ddlwe), 0, sizeof (*ddlwe)); \
|
|
|
|
(ddlwe)->ddlwe_key = (dde)->dde_key; \
|
|
|
|
(ddlwe)->ddlwe_type = (dde)->dde_type; \
|
|
|
|
(ddlwe)->ddlwe_class = (dde)->dde_class; \
|
|
|
|
memcpy(&(ddlwe)->ddlwe_phys, (dde)->dde_phys, DDT_PHYS_SIZE(ddt)); \
|
2023-07-03 15:16:04 +03:00
|
|
|
} while (0)
|
|
|
|
|
ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.
Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.
A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.
The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.
All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2023-06-22 10:46:22 +03:00
|
|
|
#define DDT_LOG_ENTRY_TO_LIGHTWEIGHT(ddt, ddle, ddlwe) do { \
|
|
|
|
memset((ddlwe), 0, sizeof (*ddlwe)); \
|
|
|
|
(ddlwe)->ddlwe_key = (ddle)->ddle_key; \
|
|
|
|
(ddlwe)->ddlwe_type = (ddle)->ddle_type; \
|
|
|
|
(ddlwe)->ddlwe_class = (ddle)->ddle_class; \
|
|
|
|
memcpy(&(ddlwe)->ddlwe_phys, (ddle)->ddle_phys, DDT_PHYS_SIZE(ddt)); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An entry on the log tree. These are "frozen", and a record of what's in
|
|
|
|
* the on-disk log. They can't be used in place, but can be "loaded" back into
|
|
|
|
* the live tree.
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
ddt_key_t ddle_key; /* ddt_log_tree key */
|
|
|
|
avl_node_t ddle_node; /* ddt_log_tree node */
|
|
|
|
|
|
|
|
ddt_type_t ddle_type; /* storage type */
|
|
|
|
ddt_class_t ddle_class; /* storage class */
|
|
|
|
|
|
|
|
/* extra allocation for flat/trad phys */
|
|
|
|
ddt_univ_phys_t ddle_phys[];
|
|
|
|
} ddt_log_entry_t;
|
|
|
|
|
|
|
|
/* On-disk log record types. */
|
|
|
|
typedef enum {
|
|
|
|
DLR_INVALID = 0, /* end of block marker */
|
|
|
|
DLR_ENTRY = 1, /* an entry to add or replace in the log tree */
|
|
|
|
} ddt_log_record_type_t;
|
|
|
|
|
|
|
|
/* On-disk log record header. */
|
|
|
|
typedef struct {
|
|
|
|
/*
|
|
|
|
* dlr_info is a packed u64, use the DLR_GET/DLR_SET macros below to
|
|
|
|
* access it.
|
|
|
|
*
|
|
|
|
* bits 0-7: record type (ddt_log_record_type_t)
|
|
|
|
* bits 8-15: length of record header+payload
|
|
|
|
* bits 16-47: reserved, all zero
|
|
|
|
* bits 48-55: if type==DLR_ENTRY, storage type (ddt_type)
|
|
|
|
* otherwise all zero
|
|
|
|
* bits 56-63: if type==DLR_ENTRY, storage class (ddt_class)
|
|
|
|
* otherwise all zero
|
|
|
|
*/
|
|
|
|
uint64_t dlr_info;
|
|
|
|
uint8_t dlr_payload[];
|
|
|
|
} ddt_log_record_t;
|
|
|
|
|
|
|
|
#define DLR_GET_TYPE(dlr) BF64_GET((dlr)->dlr_info, 0, 8)
|
|
|
|
#define DLR_SET_TYPE(dlr, v) BF64_SET((dlr)->dlr_info, 0, 8, v)
|
|
|
|
#define DLR_GET_RECLEN(dlr) BF64_GET((dlr)->dlr_info, 8, 16)
|
|
|
|
#define DLR_SET_RECLEN(dlr, v) BF64_SET((dlr)->dlr_info, 8, 16, v)
|
|
|
|
#define DLR_GET_ENTRY_TYPE(dlr) BF64_GET((dlr)->dlr_info, 48, 8)
|
|
|
|
#define DLR_SET_ENTRY_TYPE(dlr, v) BF64_SET((dlr)->dlr_info, 48, 8, v)
|
|
|
|
#define DLR_GET_ENTRY_CLASS(dlr) BF64_GET((dlr)->dlr_info, 56, 8)
|
|
|
|
#define DLR_SET_ENTRY_CLASS(dlr, v) BF64_SET((dlr)->dlr_info, 56, 8, v)
|
|
|
|
|
|
|
|
/* Payload for DLR_ENTRY. */
|
|
|
|
typedef struct {
|
|
|
|
ddt_key_t dlre_key;
|
|
|
|
ddt_univ_phys_t dlre_phys[];
|
|
|
|
} ddt_log_record_entry_t;
|
|
|
|
|
|
|
|
/* Log flags (ddl_flags, dlh_flags) */
|
|
|
|
#define DDL_FLAG_FLUSHING (1 << 0) /* this log is being flushed */
|
|
|
|
#define DDL_FLAG_CHECKPOINT (1 << 1) /* header has a checkpoint */
|
|
|
|
|
|
|
|
/* On-disk log header, stored in the bonus buffer. */
|
|
|
|
typedef struct {
|
|
|
|
/*
|
|
|
|
* dlh_info is a packed u64, use the DLH_GET/DLH_SET macros below to
|
|
|
|
* access it.
|
|
|
|
*
|
|
|
|
* bits 0-7: log version
|
|
|
|
* bits 8-15: log flags
|
|
|
|
* bits 16-63: reserved, all zero
|
|
|
|
*/
|
|
|
|
uint64_t dlh_info;
|
|
|
|
|
|
|
|
uint64_t dlh_length; /* log size in bytes */
|
|
|
|
uint64_t dlh_first_txg; /* txg this log went active */
|
|
|
|
ddt_key_t dlh_checkpoint; /* last checkpoint */
|
|
|
|
} ddt_log_header_t;
|
|
|
|
|
|
|
|
#define DLH_GET_VERSION(dlh) BF64_GET((dlh)->dlh_info, 0, 8)
|
|
|
|
#define DLH_SET_VERSION(dlh, v) BF64_SET((dlh)->dlh_info, 0, 8, v)
|
|
|
|
#define DLH_GET_FLAGS(dlh) BF64_GET((dlh)->dlh_info, 8, 8)
|
|
|
|
#define DLH_SET_FLAGS(dlh, v) BF64_SET((dlh)->dlh_info, 8, 8, v)
|
|
|
|
|
|
|
|
/* DDT log update state */
|
|
|
|
typedef struct {
|
|
|
|
dmu_tx_t *dlu_tx; /* tx the update is being applied to */
|
|
|
|
dnode_t *dlu_dn; /* log object dnode */
|
|
|
|
dmu_buf_t **dlu_dbp; /* array of block buffer pointers */
|
|
|
|
int dlu_ndbp; /* number of block buffer pointers */
|
|
|
|
uint16_t dlu_reclen; /* cached length of record */
|
|
|
|
uint64_t dlu_block; /* block for next entry */
|
|
|
|
uint64_t dlu_offset; /* offset for next entry */
|
|
|
|
} ddt_log_update_t;
|
|
|
|
|
2023-06-30 06:35:18 +03:00
|
|
|
/*
|
|
|
|
* Ops vector to access a specific DDT object type.
|
|
|
|
*/
|
2023-07-03 05:43:37 +03:00
|
|
|
typedef struct {
|
2023-06-30 06:35:18 +03:00
|
|
|
char ddt_op_name[32];
|
|
|
|
int (*ddt_op_create)(objset_t *os, uint64_t *object, dmu_tx_t *tx,
|
|
|
|
boolean_t prehash);
|
|
|
|
int (*ddt_op_destroy)(objset_t *os, uint64_t object, dmu_tx_t *tx);
|
2023-07-03 16:28:46 +03:00
|
|
|
int (*ddt_op_lookup)(objset_t *os, uint64_t object,
|
ddt: add "flat phys" feature
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.
This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.
Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.
Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2023-06-20 04:09:48 +03:00
|
|
|
const ddt_key_t *ddk, void *phys, size_t psize);
|
2023-07-03 16:28:46 +03:00
|
|
|
int (*ddt_op_contains)(objset_t *os, uint64_t object,
|
|
|
|
const ddt_key_t *ddk);
|
2023-06-30 06:35:18 +03:00
|
|
|
void (*ddt_op_prefetch)(objset_t *os, uint64_t object,
|
2023-07-03 16:28:46 +03:00
|
|
|
const ddt_key_t *ddk);
|
2024-07-26 19:16:18 +03:00
|
|
|
void (*ddt_op_prefetch_all)(objset_t *os, uint64_t object);
|
2023-07-03 16:28:46 +03:00
|
|
|
int (*ddt_op_update)(objset_t *os, uint64_t object,
|
ddt: add "flat phys" feature
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.
This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.
Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.
Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2023-06-20 04:09:48 +03:00
|
|
|
const ddt_key_t *ddk, const void *phys, size_t psize,
|
2023-06-30 06:35:18 +03:00
|
|
|
dmu_tx_t *tx);
|
2023-07-03 16:28:46 +03:00
|
|
|
int (*ddt_op_remove)(objset_t *os, uint64_t object,
|
|
|
|
const ddt_key_t *ddk, dmu_tx_t *tx);
|
|
|
|
int (*ddt_op_walk)(objset_t *os, uint64_t object, uint64_t *walk,
|
ddt: add "flat phys" feature
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.
This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.
Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.
Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2023-06-20 04:09:48 +03:00
|
|
|
ddt_key_t *ddk, void *phys, size_t psize);
|
2023-06-30 06:35:18 +03:00
|
|
|
int (*ddt_op_count)(objset_t *os, uint64_t object, uint64_t *count);
|
|
|
|
} ddt_ops_t;
|
|
|
|
|
|
|
|
extern const ddt_ops_t ddt_zap_ops;
|
|
|
|
|
ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.
Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.
A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.
The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.
All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2023-06-22 10:46:22 +03:00
|
|
|
/* Dedup log API */
|
|
|
|
extern void ddt_log_begin(ddt_t *ddt, size_t nentries, dmu_tx_t *tx,
|
|
|
|
ddt_log_update_t *dlu);
|
|
|
|
extern void ddt_log_entry(ddt_t *ddt, ddt_lightweight_entry_t *dde,
|
|
|
|
ddt_log_update_t *dlu);
|
|
|
|
extern void ddt_log_commit(ddt_t *ddt, ddt_log_update_t *dlu);
|
|
|
|
|
|
|
|
extern boolean_t ddt_log_take_first(ddt_t *ddt, ddt_log_t *ddl,
|
|
|
|
ddt_lightweight_entry_t *ddlwe);
|
|
|
|
extern boolean_t ddt_log_take_key(ddt_t *ddt, ddt_log_t *ddl,
|
|
|
|
const ddt_key_t *ddk, ddt_lightweight_entry_t *ddlwe);
|
|
|
|
|
|
|
|
extern void ddt_log_checkpoint(ddt_t *ddt, ddt_lightweight_entry_t *ddlwe,
|
|
|
|
dmu_tx_t *tx);
|
|
|
|
extern void ddt_log_truncate(ddt_t *ddt, dmu_tx_t *tx);
|
|
|
|
|
|
|
|
extern boolean_t ddt_log_swap(ddt_t *ddt, dmu_tx_t *tx);
|
|
|
|
|
|
|
|
extern void ddt_log_destroy(ddt_t *ddt, dmu_tx_t *tx);
|
|
|
|
|
|
|
|
extern int ddt_log_load(ddt_t *ddt);
|
|
|
|
extern void ddt_log_alloc(ddt_t *ddt);
|
|
|
|
extern void ddt_log_free(ddt_t *ddt);
|
|
|
|
|
|
|
|
extern void ddt_log_init(void);
|
|
|
|
extern void ddt_log_fini(void);
|
|
|
|
|
2023-06-30 06:35:18 +03:00
|
|
|
/*
|
|
|
|
* These are only exposed so that zdb can access them. Try not to use them
|
|
|
|
* outside of the DDT implementation proper, and if you do, consider moving
|
|
|
|
* them up.
|
|
|
|
*/
|
2024-02-19 13:19:32 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Enough room to expand DMU_POOL_DDT format for all possible DDT
|
|
|
|
* checksum/class/type combinations.
|
|
|
|
*/
|
|
|
|
#define DDT_NAMELEN 32
|
2023-06-30 06:35:18 +03:00
|
|
|
|
ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.
Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.
A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.
The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.
All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2023-06-22 10:46:22 +03:00
|
|
|
extern uint64_t ddt_phys_total_refcnt(const ddt_t *ddt,
|
|
|
|
const ddt_univ_phys_t *ddp);
|
2023-06-30 06:35:18 +03:00
|
|
|
|
|
|
|
extern void ddt_key_fill(ddt_key_t *ddk, const blkptr_t *bp);
|
|
|
|
|
2023-07-03 05:32:53 +03:00
|
|
|
extern void ddt_object_name(ddt_t *ddt, ddt_type_t type, ddt_class_t clazz,
|
|
|
|
char *name);
|
|
|
|
extern int ddt_object_walk(ddt_t *ddt, ddt_type_t type, ddt_class_t clazz,
|
2023-07-03 15:16:04 +03:00
|
|
|
uint64_t *walk, ddt_lightweight_entry_t *ddlwe);
|
2023-07-03 05:32:53 +03:00
|
|
|
extern int ddt_object_count(ddt_t *ddt, ddt_type_t type, ddt_class_t clazz,
|
|
|
|
uint64_t *count);
|
|
|
|
extern int ddt_object_info(ddt_t *ddt, ddt_type_t type, ddt_class_t clazz,
|
|
|
|
dmu_object_info_t *);
|
2023-06-30 06:35:18 +03:00
|
|
|
|
|
|
|
#ifdef __cplusplus
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#endif /* _SYS_DDT_H */
|