Improve zfs receive performance with lightweight write

The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
This commit is contained in:
Matthew Ahrens
2020-12-11 10:26:02 -08:00
committed by GitHub
parent 7d4b365ce3
commit ba67d82142
10 changed files with 540 additions and 267 deletions
+8 -5
View File
@@ -20,7 +20,7 @@
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012, 2019 by Delphix. All rights reserved.
* Copyright (c) 2012, 2020 by Delphix. All rights reserved.
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
*/
@@ -609,7 +609,6 @@ dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
ASSERT0(dn->dn_maxblkid);
ASSERT0(dn->dn_allocated_txg);
ASSERT0(dn->dn_assigned_txg);
ASSERT0(dn->dn_dirty_txg);
ASSERT(zfs_refcount_is_zero(&dn->dn_tx_holds));
ASSERT3U(zfs_refcount_count(&dn->dn_holds), <=, 1);
ASSERT(avl_is_empty(&dn->dn_dbufs));
@@ -649,6 +648,7 @@ dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
dn->dn_free_txg = 0;
dn->dn_dirtyctx_firstset = NULL;
dn->dn_dirty_txg = 0;
dn->dn_allocated_txg = tx->tx_txg;
dn->dn_id_flags = 0;
@@ -1812,6 +1812,7 @@ dnode_set_nlevels_impl(dnode_t *dn, int new_nlevels, dmu_tx_t *tx)
ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
ASSERT3U(new_nlevels, >, dn->dn_nlevels);
dn->dn_nlevels = new_nlevels;
ASSERT3U(new_nlevels, >, dn->dn_next_nlevels[txgoff]);
@@ -1829,10 +1830,12 @@ dnode_set_nlevels_impl(dnode_t *dn, int new_nlevels, dmu_tx_t *tx)
list = &dn->dn_dirty_records[txgoff];
for (dr = list_head(list); dr; dr = dr_next) {
dr_next = list_next(&dn->dn_dirty_records[txgoff], dr);
if (dr->dr_dbuf->db_level != new_nlevels-1 &&
IMPLY(dr->dr_dbuf == NULL, old_nlevels == 1);
if (dr->dr_dbuf == NULL ||
(dr->dr_dbuf->db_level == old_nlevels - 1 &&
dr->dr_dbuf->db_blkid != DMU_BONUS_BLKID &&
dr->dr_dbuf->db_blkid != DMU_SPILL_BLKID) {
ASSERT(dr->dr_dbuf->db_level == old_nlevels-1);
dr->dr_dbuf->db_blkid != DMU_SPILL_BLKID)) {
list_remove(&dn->dn_dirty_records[txgoff], dr);
list_insert_tail(&new->dt.di.dr_children, dr);
dr->dr_parent = new;