2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.
At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);
These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)
Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.
Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().
References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-07-26 22:37:06 +04:00
|
|
|
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
* Copyright (c) 2012, 2020 by Delphix. All rights reserved.
|
2013-08-02 00:02:10 +04:00
|
|
|
* Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
|
2015-04-02 06:44:32 +03:00
|
|
|
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
* Copyright (c) 2019, Klara Inc.
|
|
|
|
* Copyright (c) 2019, Allan Jude
|
2023-03-10 22:59:53 +03:00
|
|
|
* Copyright (c) 2021, 2022 by Pawel Jakub Dawidek
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/zfs_context.h>
|
2010-08-26 22:49:16 +04:00
|
|
|
#include <sys/arc.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dmu.h>
|
2013-07-29 22:58:53 +04:00
|
|
|
#include <sys/dmu_send.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dmu_impl.h>
|
|
|
|
#include <sys/dbuf.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
#include <sys/dsl_dir.h>
|
|
|
|
#include <sys/dmu_tx.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/dmu_zfetch.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/sa.h>
|
|
|
|
#include <sys/sa_impl.h>
|
2014-06-06 01:19:08 +04:00
|
|
|
#include <sys/zfeature.h>
|
|
|
|
#include <sys/blkptr.h>
|
2014-04-16 07:40:22 +04:00
|
|
|
#include <sys/range_tree.h>
|
Enable use of DTRACE_PROBE* macros in "spl" module
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.
Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.
Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.
Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9525
2019-10-30 21:02:41 +03:00
|
|
|
#include <sys/trace_zfs.h>
|
2016-06-02 07:04:53 +03:00
|
|
|
#include <sys/callb.h>
|
2016-07-22 18:52:49 +03:00
|
|
|
#include <sys/abd.h>
|
2023-03-10 22:59:53 +03:00
|
|
|
#include <sys/brt.h>
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
#include <sys/vdev.h>
|
2020-03-27 19:11:22 +03:00
|
|
|
#include <cityhash.h>
|
2018-07-10 20:49:50 +03:00
|
|
|
#include <sys/spa_impl.h>
|
2021-06-17 03:19:34 +03:00
|
|
|
#include <sys/wmsum.h>
|
2021-11-11 23:52:16 +03:00
|
|
|
#include <sys/vdev_impl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
static kstat_t *dbuf_ksp;
|
2018-01-29 21:24:52 +03:00
|
|
|
|
|
|
|
typedef struct dbuf_stats {
|
|
|
|
/*
|
|
|
|
* Various statistics about the size of the dbuf cache.
|
|
|
|
*/
|
|
|
|
kstat_named_t cache_count;
|
|
|
|
kstat_named_t cache_size_bytes;
|
|
|
|
kstat_named_t cache_size_bytes_max;
|
|
|
|
/*
|
|
|
|
* Statistics regarding the bounds on the dbuf cache size.
|
|
|
|
*/
|
|
|
|
kstat_named_t cache_target_bytes;
|
|
|
|
kstat_named_t cache_lowater_bytes;
|
|
|
|
kstat_named_t cache_hiwater_bytes;
|
|
|
|
/*
|
|
|
|
* Total number of dbuf cache evictions that have occurred.
|
|
|
|
*/
|
|
|
|
kstat_named_t cache_total_evicts;
|
|
|
|
/*
|
|
|
|
* The distribution of dbuf levels in the dbuf cache and
|
|
|
|
* the total size of all dbufs at each level.
|
|
|
|
*/
|
|
|
|
kstat_named_t cache_levels[DN_MAX_LEVELS];
|
|
|
|
kstat_named_t cache_levels_bytes[DN_MAX_LEVELS];
|
|
|
|
/*
|
|
|
|
* Statistics about the dbuf hash table.
|
|
|
|
*/
|
|
|
|
kstat_named_t hash_hits;
|
|
|
|
kstat_named_t hash_misses;
|
|
|
|
kstat_named_t hash_collisions;
|
|
|
|
kstat_named_t hash_elements;
|
|
|
|
kstat_named_t hash_elements_max;
|
|
|
|
/*
|
|
|
|
* Number of sublists containing more than one dbuf in the dbuf
|
|
|
|
* hash table. Keep track of the longest hash chain.
|
|
|
|
*/
|
|
|
|
kstat_named_t hash_chains;
|
|
|
|
kstat_named_t hash_chain_max;
|
|
|
|
/*
|
|
|
|
* Number of times a dbuf_create() discovers that a dbuf was
|
|
|
|
* already created and in the dbuf hash table.
|
|
|
|
*/
|
|
|
|
kstat_named_t hash_insert_race;
|
2022-09-19 22:17:11 +03:00
|
|
|
/*
|
|
|
|
* Number of entries in the hash table dbuf and mutex arrays.
|
|
|
|
*/
|
|
|
|
kstat_named_t hash_table_count;
|
|
|
|
kstat_named_t hash_mutex_count;
|
2018-07-10 20:49:50 +03:00
|
|
|
/*
|
|
|
|
* Statistics about the size of the metadata dbuf cache.
|
|
|
|
*/
|
|
|
|
kstat_named_t metadata_cache_count;
|
|
|
|
kstat_named_t metadata_cache_size_bytes;
|
|
|
|
kstat_named_t metadata_cache_size_bytes_max;
|
|
|
|
/*
|
|
|
|
* For diagnostic purposes, this is incremented whenever we can't add
|
|
|
|
* something to the metadata cache because it's full, and instead put
|
|
|
|
* the data in the regular dbuf cache.
|
|
|
|
*/
|
|
|
|
kstat_named_t metadata_cache_overflow;
|
2018-01-29 21:24:52 +03:00
|
|
|
} dbuf_stats_t;
|
|
|
|
|
|
|
|
dbuf_stats_t dbuf_stats = {
|
|
|
|
{ "cache_count", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cache_size_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cache_size_bytes_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cache_target_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cache_lowater_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cache_hiwater_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cache_total_evicts", KSTAT_DATA_UINT64 },
|
|
|
|
{ { "cache_levels_N", KSTAT_DATA_UINT64 } },
|
|
|
|
{ { "cache_levels_bytes_N", KSTAT_DATA_UINT64 } },
|
|
|
|
{ "hash_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_collisions", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_elements", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_elements_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chains", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chain_max", KSTAT_DATA_UINT64 },
|
2018-07-10 20:49:50 +03:00
|
|
|
{ "hash_insert_race", KSTAT_DATA_UINT64 },
|
2022-09-19 22:17:11 +03:00
|
|
|
{ "hash_table_count", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_mutex_count", KSTAT_DATA_UINT64 },
|
2018-07-10 20:49:50 +03:00
|
|
|
{ "metadata_cache_count", KSTAT_DATA_UINT64 },
|
|
|
|
{ "metadata_cache_size_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "metadata_cache_size_bytes_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "metadata_cache_overflow", KSTAT_DATA_UINT64 }
|
2018-01-29 21:24:52 +03:00
|
|
|
};
|
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
struct {
|
|
|
|
wmsum_t cache_count;
|
|
|
|
wmsum_t cache_total_evicts;
|
|
|
|
wmsum_t cache_levels[DN_MAX_LEVELS];
|
|
|
|
wmsum_t cache_levels_bytes[DN_MAX_LEVELS];
|
|
|
|
wmsum_t hash_hits;
|
|
|
|
wmsum_t hash_misses;
|
|
|
|
wmsum_t hash_collisions;
|
|
|
|
wmsum_t hash_chains;
|
|
|
|
wmsum_t hash_insert_race;
|
|
|
|
wmsum_t metadata_cache_count;
|
|
|
|
wmsum_t metadata_cache_overflow;
|
|
|
|
} dbuf_sums;
|
|
|
|
|
2018-01-29 21:24:52 +03:00
|
|
|
#define DBUF_STAT_INCR(stat, val) \
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_add(&dbuf_sums.stat, val);
|
2018-01-29 21:24:52 +03:00
|
|
|
#define DBUF_STAT_DECR(stat, val) \
|
|
|
|
DBUF_STAT_INCR(stat, -(val));
|
|
|
|
#define DBUF_STAT_BUMP(stat) \
|
|
|
|
DBUF_STAT_INCR(stat, 1);
|
|
|
|
#define DBUF_STAT_BUMPDOWN(stat) \
|
|
|
|
DBUF_STAT_INCR(stat, -1);
|
|
|
|
#define DBUF_STAT_MAX(stat, v) { \
|
|
|
|
uint64_t _m; \
|
|
|
|
while ((v) > (_m = dbuf_stats.stat.value.ui64) && \
|
|
|
|
(_m != atomic_cas_64(&dbuf_stats.stat.value.ui64, _m, (v))))\
|
|
|
|
continue; \
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
static void dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx);
|
2020-02-08 01:22:29 +03:00
|
|
|
static void dbuf_sync_leaf_verify_bonus_dnode(dbuf_dirty_record_t *dr);
|
2020-02-18 22:21:37 +03:00
|
|
|
static int dbuf_read_verify_dnode_crypt(dmu_buf_impl_t *db, uint32_t flags);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Global data structures and functions for the dbuf cache.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static kmem_cache_t *dbuf_kmem_cache;
|
2015-04-02 06:44:32 +03:00
|
|
|
static taskq_t *dbu_evict_taskq;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static kthread_t *dbuf_cache_evict_thread;
|
|
|
|
static kmutex_t dbuf_evict_lock;
|
|
|
|
static kcondvar_t dbuf_evict_cv;
|
|
|
|
static boolean_t dbuf_evict_thread_exit;
|
|
|
|
|
|
|
|
/*
|
2018-07-10 20:49:50 +03:00
|
|
|
* There are two dbuf caches; each dbuf can only be in one of them at a time.
|
|
|
|
*
|
|
|
|
* 1. Cache of metadata dbufs, to help make read-heavy administrative commands
|
|
|
|
* from /sbin/zfs run faster. The "metadata cache" specifically stores dbufs
|
|
|
|
* that represent the metadata that describes filesystems/snapshots/
|
|
|
|
* bookmarks/properties/etc. We only evict from this cache when we export a
|
|
|
|
* pool, to short-circuit as much I/O as possible for all administrative
|
|
|
|
* commands that need the metadata. There is no eviction policy for this
|
|
|
|
* cache, because we try to only include types in it which would occupy a
|
|
|
|
* very small amount of space per object but create a large impact on the
|
|
|
|
* performance of these commands. Instead, after it reaches a maximum size
|
|
|
|
* (which should only happen on very small memory systems with a very large
|
|
|
|
* number of filesystem objects), we stop taking new dbufs into the
|
|
|
|
* metadata cache, instead putting them in the normal dbuf cache.
|
|
|
|
*
|
|
|
|
* 2. LRU cache of dbufs. The dbuf cache maintains a list of dbufs that
|
|
|
|
* are not currently held but have been recently released. These dbufs
|
|
|
|
* are not eligible for arc eviction until they are aged out of the cache.
|
|
|
|
* Dbufs that are aged out of the cache will be immediately destroyed and
|
|
|
|
* become eligible for arc eviction.
|
|
|
|
*
|
|
|
|
* Dbufs are added to these caches once the last hold is released. If a dbuf is
|
|
|
|
* later accessed and still exists in the dbuf cache, then it will be removed
|
|
|
|
* from the cache and later re-added to the head of the cache.
|
|
|
|
*
|
|
|
|
* If a given dbuf meets the requirements for the metadata cache, it will go
|
|
|
|
* there, otherwise it will be considered for the generic LRU dbuf cache. The
|
|
|
|
* caches and the refcounts tracking their sizes are stored in an array indexed
|
|
|
|
* by those caches' matching enum values (from dbuf_cached_state_t).
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2018-07-10 20:49:50 +03:00
|
|
|
typedef struct dbuf_cache {
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_t cache;
|
|
|
|
zfs_refcount_t size ____cacheline_aligned;
|
2018-07-10 20:49:50 +03:00
|
|
|
} dbuf_cache_t;
|
|
|
|
dbuf_cache_t dbuf_caches[DB_CACHE_MAX];
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2018-07-10 20:49:50 +03:00
|
|
|
/* Size limits for the caches */
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
static uint64_t dbuf_cache_max_bytes = UINT64_MAX;
|
|
|
|
static uint64_t dbuf_metadata_cache_max_bytes = UINT64_MAX;
|
2020-07-25 06:38:48 +03:00
|
|
|
|
2018-07-10 20:49:50 +03:00
|
|
|
/* Set the default sizes of the caches to log2 fraction of arc size */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t dbuf_cache_shift = 5;
|
|
|
|
static uint_t dbuf_metadata_cache_shift = 6;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2022-09-19 22:17:11 +03:00
|
|
|
/* Set the dbuf hash mutex count as log2 shift (dynamic by default) */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t dbuf_mutex_cache_shift = 0;
|
2022-09-19 22:17:11 +03:00
|
|
|
|
2020-07-25 06:38:48 +03:00
|
|
|
static unsigned long dbuf_cache_target_bytes(void);
|
|
|
|
static unsigned long dbuf_metadata_cache_target_bytes(void);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2018-07-10 20:49:50 +03:00
|
|
|
* The LRU dbuf cache uses a three-stage eviction policy:
|
2016-06-02 07:04:53 +03:00
|
|
|
* - A low water marker designates when the dbuf eviction thread
|
|
|
|
* should stop evicting from the dbuf cache.
|
|
|
|
* - When we reach the maximum size (aka mid water mark), we
|
|
|
|
* signal the eviction thread to run.
|
|
|
|
* - The high water mark indicates when the eviction thread
|
|
|
|
* is unable to keep up with the incoming load and eviction must
|
|
|
|
* happen in the context of the calling thread.
|
|
|
|
*
|
|
|
|
* The dbuf cache:
|
|
|
|
* (max size)
|
|
|
|
* low water mid water hi water
|
|
|
|
* +----------------------------------------+----------+----------+
|
|
|
|
* | | | |
|
|
|
|
* | | | |
|
|
|
|
* | | | |
|
|
|
|
* | | | |
|
|
|
|
* +----------------------------------------+----------+----------+
|
|
|
|
* stop signal evict
|
|
|
|
* evicting eviction directly
|
|
|
|
* thread
|
|
|
|
*
|
|
|
|
* The high and low water marks indicate the operating range for the eviction
|
|
|
|
* thread. The low water mark is, by default, 90% of the total size of the
|
|
|
|
* cache and the high water mark is at 110% (both of these percentages can be
|
|
|
|
* changed by setting dbuf_cache_lowater_pct and dbuf_cache_hiwater_pct,
|
|
|
|
* respectively). The eviction thread will try to ensure that the cache remains
|
|
|
|
* within this range by waking up every second and checking if the cache is
|
|
|
|
* above the low water mark. The thread can also be woken up by callers adding
|
|
|
|
* elements into the cache if the cache is larger than the mid water (i.e max
|
|
|
|
* cache size). Once the eviction thread is woken up and eviction is required,
|
|
|
|
* it will continue evicting buffers until it's able to reduce the cache size
|
|
|
|
* to the low water mark. If the cache size continues to grow and hits the high
|
2017-01-03 20:31:18 +03:00
|
|
|
* water mark, then callers adding elements to the cache will begin to evict
|
2016-06-02 07:04:53 +03:00
|
|
|
* directly from the cache until the cache is no longer above the high water
|
|
|
|
* mark.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The percentage above and below the maximum cache size.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static uint_t dbuf_cache_hiwater_pct = 10;
|
|
|
|
static uint_t dbuf_cache_lowater_pct = 10;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
|
|
|
dbuf_cons(void *vdb, void *unused, int kmflag)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused, (void) kmflag;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db = vdb;
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(db, 0, sizeof (dmu_buf_impl_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_init(&db->db_mtx, NULL, MUTEX_DEFAULT, NULL);
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_init(&db->db_rwlock, NULL, RW_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
cv_init(&db->db_changed, NULL, CV_DEFAULT, NULL);
|
2016-06-02 07:04:53 +03:00
|
|
|
multilist_link_init(&db->db_cache_link);
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_create(&db->db_holds);
|
2015-04-03 06:14:28 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_dest(void *vdb, void *unused)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db = vdb;
|
|
|
|
mutex_destroy(&db->db_mtx);
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_destroy(&db->db_rwlock);
|
2008-11-20 23:01:55 +03:00
|
|
|
cv_destroy(&db->db_changed);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!multilist_link_active(&db->db_cache_link));
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_destroy(&db->db_holds);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* dbuf hash table routines
|
|
|
|
*/
|
|
|
|
static dbuf_hash_table_t dbuf_hash_table;
|
|
|
|
|
2017-05-25 21:32:40 +03:00
|
|
|
/*
|
|
|
|
* We use Cityhash for this. It's fast, and has good hash properties without
|
|
|
|
* requiring any large static buffers.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static uint64_t
|
|
|
|
dbuf_hash(void *os, uint64_t obj, uint8_t lvl, uint64_t blkid)
|
|
|
|
{
|
2017-05-25 21:32:40 +03:00
|
|
|
return (cityhash4((uintptr_t)os, obj, (uint64_t)lvl, blkid));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-02-18 22:21:37 +03:00
|
|
|
#define DTRACE_SET_STATE(db, why) \
|
|
|
|
DTRACE_PROBE2(dbuf__state_change, dmu_buf_impl_t *, db, \
|
|
|
|
const char *, why)
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define DBUF_EQUAL(dbuf, os, obj, level, blkid) \
|
|
|
|
((dbuf)->db.db_object == (obj) && \
|
|
|
|
(dbuf)->db_objset == (os) && \
|
|
|
|
(dbuf)->db_level == (level) && \
|
|
|
|
(dbuf)->db_blkid == (blkid))
|
|
|
|
|
|
|
|
dmu_buf_impl_t *
|
2022-12-14 04:29:21 +03:00
|
|
|
dbuf_find(objset_t *os, uint64_t obj, uint8_t level, uint64_t blkid,
|
|
|
|
uint64_t *hash_out)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dbuf_hash_table_t *h = &dbuf_hash_table;
|
2010-08-26 20:52:39 +04:00
|
|
|
uint64_t hv;
|
|
|
|
uint64_t idx;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hv = dbuf_hash(os, obj, level, blkid);
|
2010-08-26 20:52:39 +04:00
|
|
|
idx = hv & h->hash_table_mask;
|
|
|
|
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_enter(DBUF_HASH_MUTEX(h, idx));
|
2008-11-20 23:01:55 +03:00
|
|
|
for (db = h->hash_table[idx]; db != NULL; db = db->db_hash_next) {
|
|
|
|
if (DBUF_EQUAL(db, os, obj, level, blkid)) {
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
if (db->db_state != DB_EVICTING) {
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_exit(DBUF_HASH_MUTEX(h, idx));
|
2008-11-20 23:01:55 +03:00
|
|
|
return (db);
|
|
|
|
}
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
}
|
|
|
|
}
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_exit(DBUF_HASH_MUTEX(h, idx));
|
2022-12-14 04:29:21 +03:00
|
|
|
if (hash_out != NULL)
|
|
|
|
*hash_out = hv;
|
2008-11-20 23:01:55 +03:00
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
2015-04-02 14:59:15 +03:00
|
|
|
static dmu_buf_impl_t *
|
|
|
|
dbuf_find_bonus(objset_t *os, uint64_t object)
|
|
|
|
{
|
|
|
|
dnode_t *dn;
|
|
|
|
dmu_buf_impl_t *db = NULL;
|
|
|
|
|
|
|
|
if (dnode_hold(os, object, FTAG, &dn) == 0) {
|
|
|
|
rw_enter(&dn->dn_struct_rwlock, RW_READER);
|
|
|
|
if (dn->dn_bonus != NULL) {
|
|
|
|
db = dn->dn_bonus;
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
}
|
|
|
|
rw_exit(&dn->dn_struct_rwlock);
|
|
|
|
dnode_rele(dn, FTAG);
|
|
|
|
}
|
|
|
|
return (db);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Insert an entry into the hash table. If there is already an element
|
|
|
|
* equal to elem in the hash table, then the already existing element
|
|
|
|
* will be returned and the new element will not be inserted.
|
|
|
|
* Otherwise returns NULL.
|
|
|
|
*/
|
|
|
|
static dmu_buf_impl_t *
|
|
|
|
dbuf_hash_insert(dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
dbuf_hash_table_t *h = &dbuf_hash_table;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = db->db_objset;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t obj = db->db.db_object;
|
|
|
|
int level = db->db_level;
|
2022-12-14 04:29:21 +03:00
|
|
|
uint64_t blkid, idx;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *dbf;
|
2018-01-29 21:24:52 +03:00
|
|
|
uint32_t i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
blkid = db->db_blkid;
|
2022-12-14 04:29:21 +03:00
|
|
|
ASSERT3U(dbuf_hash(os, obj, level, blkid), ==, db->db_hash);
|
|
|
|
idx = db->db_hash & h->hash_table_mask;
|
2010-08-26 20:52:39 +04:00
|
|
|
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_enter(DBUF_HASH_MUTEX(h, idx));
|
2018-01-29 21:24:52 +03:00
|
|
|
for (dbf = h->hash_table[idx], i = 0; dbf != NULL;
|
|
|
|
dbf = dbf->db_hash_next, i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (DBUF_EQUAL(dbf, os, obj, level, blkid)) {
|
|
|
|
mutex_enter(&dbf->db_mtx);
|
|
|
|
if (dbf->db_state != DB_EVICTING) {
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_exit(DBUF_HASH_MUTEX(h, idx));
|
2008-11-20 23:01:55 +03:00
|
|
|
return (dbf);
|
|
|
|
}
|
|
|
|
mutex_exit(&dbf->db_mtx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-01-29 21:24:52 +03:00
|
|
|
if (i > 0) {
|
|
|
|
DBUF_STAT_BUMP(hash_collisions);
|
|
|
|
if (i == 1)
|
|
|
|
DBUF_STAT_BUMP(hash_chains);
|
|
|
|
|
|
|
|
DBUF_STAT_MAX(hash_chain_max, i);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
db->db_hash_next = h->hash_table[idx];
|
|
|
|
h->hash_table[idx] = db;
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_exit(DBUF_HASH_MUTEX(h, idx));
|
2021-06-17 03:19:34 +03:00
|
|
|
uint64_t he = atomic_inc_64_nv(&dbuf_stats.hash_elements.value.ui64);
|
|
|
|
DBUF_STAT_MAX(hash_elements_max, he);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
2018-07-10 20:49:50 +03:00
|
|
|
/*
|
|
|
|
* This returns whether this dbuf should be stored in the metadata cache, which
|
|
|
|
* is based on whether it's from one of the dnode types that store data related
|
|
|
|
* to traversing dataset hierarchies.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
dbuf_include_in_metadata_cache(dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dmu_object_type_t type = DB_DNODE(db)->dn_type;
|
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
|
|
|
|
/* Check if this dbuf is one of the types we care about */
|
|
|
|
if (DMU_OT_IS_METADATA_CACHED(type)) {
|
|
|
|
/* If we hit this, then we set something up wrong in dmu_ot */
|
|
|
|
ASSERT(DMU_OT_IS_METADATA(type));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check for small-memory systems: don't allocate too
|
|
|
|
* much memory for this purpose.
|
|
|
|
*/
|
2018-10-01 20:42:05 +03:00
|
|
|
if (zfs_refcount_count(
|
|
|
|
&dbuf_caches[DB_DBUF_METADATA_CACHE].size) >
|
2020-07-25 06:38:48 +03:00
|
|
|
dbuf_metadata_cache_target_bytes()) {
|
2018-07-10 20:49:50 +03:00
|
|
|
DBUF_STAT_BUMP(metadata_cache_overflow);
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2014-07-15 11:43:18 +04:00
|
|
|
* Remove an entry from the hash table. It must be in the EVICTING state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_hash_remove(dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
dbuf_hash_table_t *h = &dbuf_hash_table;
|
2022-12-14 04:29:21 +03:00
|
|
|
uint64_t idx;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *dbf, **dbp;
|
|
|
|
|
2022-12-14 04:29:21 +03:00
|
|
|
ASSERT3U(dbuf_hash(db->db_objset, db->db.db_object, db->db_level,
|
|
|
|
db->db_blkid), ==, db->db_hash);
|
|
|
|
idx = db->db_hash & h->hash_table_mask;
|
2010-08-26 20:52:39 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2017-01-03 20:31:18 +03:00
|
|
|
* We mustn't hold db_mtx to maintain lock ordering:
|
2022-09-19 21:07:15 +03:00
|
|
|
* DBUF_HASH_MUTEX > db_mtx.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_state == DB_EVICTING);
|
|
|
|
ASSERT(!MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_enter(DBUF_HASH_MUTEX(h, idx));
|
2008-11-20 23:01:55 +03:00
|
|
|
dbp = &h->hash_table[idx];
|
|
|
|
while ((dbf = *dbp) != db) {
|
|
|
|
dbp = &dbf->db_hash_next;
|
|
|
|
ASSERT(dbf != NULL);
|
|
|
|
}
|
|
|
|
*dbp = db->db_hash_next;
|
|
|
|
db->db_hash_next = NULL;
|
2018-01-29 21:24:52 +03:00
|
|
|
if (h->hash_table[idx] &&
|
|
|
|
h->hash_table[idx]->db_hash_next == NULL)
|
|
|
|
DBUF_STAT_BUMPDOWN(hash_chains);
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_exit(DBUF_HASH_MUTEX(h, idx));
|
2021-06-17 03:19:34 +03:00
|
|
|
atomic_dec_64(&dbuf_stats.hash_elements.value.ui64);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
typedef enum {
|
|
|
|
DBVU_EVICTING,
|
|
|
|
DBVU_NOT_EVICTING
|
|
|
|
} dbvu_verify_type_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_verify_user(dmu_buf_impl_t *db, dbvu_verify_type_t verify_type)
|
|
|
|
{
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
int64_t holds;
|
|
|
|
|
|
|
|
if (db->db_user == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Only data blocks support the attachment of user data. */
|
|
|
|
ASSERT(db->db_level == 0);
|
|
|
|
|
|
|
|
/* Clients must resolve a dbuf before attaching user data. */
|
|
|
|
ASSERT(db->db.db_data != NULL);
|
|
|
|
ASSERT3U(db->db_state, ==, DB_CACHED);
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
holds = zfs_refcount_count(&db->db_holds);
|
2015-04-02 06:44:32 +03:00
|
|
|
if (verify_type == DBVU_EVICTING) {
|
|
|
|
/*
|
|
|
|
* Immediate eviction occurs when holds == dirtycnt.
|
|
|
|
* For normal eviction buffers, holds is zero on
|
|
|
|
* eviction, except when dbuf_fix_old_data() calls
|
|
|
|
* dbuf_clear_data(). However, the hold count can grow
|
|
|
|
* during eviction even though db_mtx is held (see
|
|
|
|
* dmu_bonus_hold() for an example), so we can only
|
|
|
|
* test the generic invariant that holds >= dirtycnt.
|
|
|
|
*/
|
|
|
|
ASSERT3U(holds, >=, db->db_dirtycnt);
|
|
|
|
} else {
|
2015-10-14 00:09:45 +03:00
|
|
|
if (db->db_user_immediate_evict == TRUE)
|
2015-04-02 06:44:32 +03:00
|
|
|
ASSERT3U(holds, >=, db->db_dirtycnt);
|
|
|
|
else
|
|
|
|
ASSERT3U(holds, >, 0);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dbuf_evict_user(dmu_buf_impl_t *db)
|
|
|
|
{
|
2015-04-02 06:44:32 +03:00
|
|
|
dmu_buf_user_t *dbu = db->db_user;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
if (dbu == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
dbuf_verify_user(db, DBVU_EVICTING);
|
|
|
|
db->db_user = NULL;
|
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
if (dbu->dbu_clear_on_evict_dbufp != NULL)
|
|
|
|
*dbu->dbu_clear_on_evict_dbufp = NULL;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2017-01-27 01:43:28 +03:00
|
|
|
* There are two eviction callbacks - one that we call synchronously
|
|
|
|
* and one that we invoke via a taskq. The async one is useful for
|
|
|
|
* avoiding lock order reversals and limiting stack depth.
|
|
|
|
*
|
|
|
|
* Note that if we have a sync callback but no async callback,
|
|
|
|
* it's likely that the sync callback will free the structure
|
|
|
|
* containing the dbu. In that case we need to take care to not
|
|
|
|
* dereference dbu after calling the sync evict func.
|
2015-04-02 06:44:32 +03:00
|
|
|
*/
|
2017-04-12 00:56:54 +03:00
|
|
|
boolean_t has_async = (dbu->dbu_evict_func_async != NULL);
|
2017-01-27 01:43:28 +03:00
|
|
|
|
|
|
|
if (dbu->dbu_evict_func_sync != NULL)
|
|
|
|
dbu->dbu_evict_func_sync(dbu);
|
|
|
|
|
|
|
|
if (has_async) {
|
|
|
|
taskq_dispatch_ent(dbu_evict_taskq, dbu->dbu_evict_func_async,
|
|
|
|
dbu, 0, &dbu->dbu_tqent);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
boolean_t
|
|
|
|
dbuf_is_metadata(dmu_buf_impl_t *db)
|
|
|
|
{
|
2014-05-02 23:26:47 +04:00
|
|
|
/*
|
|
|
|
* Consider indirect blocks and spill blocks to be meta data.
|
|
|
|
*/
|
|
|
|
if (db->db_level > 0 || db->db_blkid == DMU_SPILL_BLKID) {
|
2010-08-27 01:24:34 +04:00
|
|
|
return (B_TRUE);
|
|
|
|
} else {
|
|
|
|
boolean_t is_metadata;
|
|
|
|
|
|
|
|
DB_DNODE_ENTER(db);
|
2012-12-14 03:24:15 +04:00
|
|
|
is_metadata = DMU_OT_IS_METADATA(DB_DNODE(db)->dn_type);
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
|
|
|
|
return (is_metadata);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-11-11 23:52:16 +03:00
|
|
|
/*
|
|
|
|
* We want to exclude buffers that are on a special allocation class from
|
|
|
|
* L2ARC.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
dbuf_is_l2cacheable(dmu_buf_impl_t *db)
|
|
|
|
{
|
2023-03-07 03:13:05 +03:00
|
|
|
if (db->db_objset->os_secondary_cache == ZFS_CACHE_ALL ||
|
|
|
|
(db->db_objset->os_secondary_cache ==
|
|
|
|
ZFS_CACHE_METADATA && dbuf_is_metadata(db))) {
|
|
|
|
if (l2arc_exclude_special == 0)
|
|
|
|
return (B_TRUE);
|
|
|
|
|
|
|
|
blkptr_t *bp = db->db_blkptr;
|
|
|
|
if (bp == NULL || BP_IS_HOLE(bp))
|
|
|
|
return (B_FALSE);
|
2021-11-11 23:52:16 +03:00
|
|
|
uint64_t vdev = DVA_GET_VDEV(bp->blk_dva);
|
|
|
|
vdev_t *rvd = db->db_objset->os_spa->spa_root_vdev;
|
2023-03-07 03:13:05 +03:00
|
|
|
vdev_t *vd = NULL;
|
2021-11-11 23:52:16 +03:00
|
|
|
|
|
|
|
if (vdev < rvd->vdev_children)
|
|
|
|
vd = rvd->vdev_child[vdev];
|
|
|
|
|
2023-03-07 03:13:05 +03:00
|
|
|
if (vd == NULL)
|
|
|
|
return (B_TRUE);
|
2021-11-11 23:52:16 +03:00
|
|
|
|
2023-03-07 03:13:05 +03:00
|
|
|
if (vd->vdev_alloc_bias != VDEV_BIAS_SPECIAL &&
|
|
|
|
vd->vdev_alloc_bias != VDEV_BIAS_DEDUP)
|
|
|
|
return (B_TRUE);
|
2021-11-11 23:52:16 +03:00
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline boolean_t
|
|
|
|
dnode_level_is_l2cacheable(blkptr_t *bp, dnode_t *dn, int64_t level)
|
|
|
|
{
|
2023-03-07 03:13:05 +03:00
|
|
|
if (dn->dn_objset->os_secondary_cache == ZFS_CACHE_ALL ||
|
|
|
|
(dn->dn_objset->os_secondary_cache == ZFS_CACHE_METADATA &&
|
|
|
|
(level > 0 ||
|
|
|
|
DMU_OT_IS_METADATA(dn->dn_handle->dnh_dnode->dn_type)))) {
|
|
|
|
if (l2arc_exclude_special == 0)
|
|
|
|
return (B_TRUE);
|
|
|
|
|
|
|
|
if (bp == NULL || BP_IS_HOLE(bp))
|
|
|
|
return (B_FALSE);
|
2021-11-11 23:52:16 +03:00
|
|
|
uint64_t vdev = DVA_GET_VDEV(bp->blk_dva);
|
|
|
|
vdev_t *rvd = dn->dn_objset->os_spa->spa_root_vdev;
|
2023-03-07 03:13:05 +03:00
|
|
|
vdev_t *vd = NULL;
|
2021-11-11 23:52:16 +03:00
|
|
|
|
|
|
|
if (vdev < rvd->vdev_children)
|
|
|
|
vd = rvd->vdev_child[vdev];
|
|
|
|
|
2023-03-07 03:13:05 +03:00
|
|
|
if (vd == NULL)
|
|
|
|
return (B_TRUE);
|
2021-11-11 23:52:16 +03:00
|
|
|
|
2023-03-07 03:13:05 +03:00
|
|
|
if (vd->vdev_alloc_bias != VDEV_BIAS_SPECIAL &&
|
|
|
|
vd->vdev_alloc_bias != VDEV_BIAS_DEDUP)
|
|
|
|
return (B_TRUE);
|
2021-11-11 23:52:16 +03:00
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This function *must* return indices evenly distributed between all
|
|
|
|
* sublists of the multilist. This is needed due to how the dbuf eviction
|
|
|
|
* code is laid out; dbuf_evict_thread() assumes dbufs are evenly
|
|
|
|
* distributed between all sublists and uses this assumption when
|
|
|
|
* deciding which sublist to evict from and how much to evict from it.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static unsigned int
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_cache_multilist_index_func(multilist_t *ml, void *obj)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
dmu_buf_impl_t *db = obj;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The assumption here, is the hash value for a given
|
|
|
|
* dmu_buf_impl_t will remain constant throughout it's lifetime
|
|
|
|
* (i.e. it's objset, object, level and blkid fields don't change).
|
|
|
|
* Thus, we don't need to store the dbuf's sublist index
|
|
|
|
* on insertion, as this index can be recalculated on removal.
|
|
|
|
*
|
|
|
|
* Also, the low order bits of the hash value are thought to be
|
|
|
|
* distributed evenly. Otherwise, in the case that the multilist
|
|
|
|
* has a power of two number of sublists, each sublists' usage
|
2021-06-29 15:59:14 +03:00
|
|
|
* would not be evenly distributed. In this context full 64bit
|
|
|
|
* division would be a waste of time, so limit it to 32 bits.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2021-06-29 15:59:14 +03:00
|
|
|
return ((unsigned int)dbuf_hash(db->db_objset, db->db.db_object,
|
2016-06-02 07:04:53 +03:00
|
|
|
db->db_level, db->db_blkid) %
|
|
|
|
multilist_get_num_sublists(ml));
|
|
|
|
}
|
|
|
|
|
2020-07-25 06:38:48 +03:00
|
|
|
/*
|
|
|
|
* The target size of the dbuf cache can grow with the ARC target,
|
|
|
|
* unless limited by the tunable dbuf_cache_max_bytes.
|
|
|
|
*/
|
2017-09-30 01:49:19 +03:00
|
|
|
static inline unsigned long
|
|
|
|
dbuf_cache_target_bytes(void)
|
|
|
|
{
|
2020-07-25 06:38:48 +03:00
|
|
|
return (MIN(dbuf_cache_max_bytes,
|
|
|
|
arc_target_bytes() >> dbuf_cache_shift));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The target size of the dbuf metadata cache can grow with the ARC target,
|
|
|
|
* unless limited by the tunable dbuf_metadata_cache_max_bytes.
|
|
|
|
*/
|
|
|
|
static inline unsigned long
|
|
|
|
dbuf_metadata_cache_target_bytes(void)
|
|
|
|
{
|
|
|
|
return (MIN(dbuf_metadata_cache_max_bytes,
|
|
|
|
arc_target_bytes() >> dbuf_metadata_cache_shift));
|
2017-09-30 01:49:19 +03:00
|
|
|
}
|
|
|
|
|
2018-01-29 21:24:52 +03:00
|
|
|
static inline uint64_t
|
|
|
|
dbuf_cache_hiwater_bytes(void)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
2017-09-30 01:49:19 +03:00
|
|
|
uint64_t dbuf_cache_target = dbuf_cache_target_bytes();
|
2018-01-29 21:24:52 +03:00
|
|
|
return (dbuf_cache_target +
|
|
|
|
(dbuf_cache_target * dbuf_cache_hiwater_pct) / 100);
|
|
|
|
}
|
2017-09-30 01:49:19 +03:00
|
|
|
|
2018-01-29 21:24:52 +03:00
|
|
|
static inline uint64_t
|
|
|
|
dbuf_cache_lowater_bytes(void)
|
|
|
|
{
|
|
|
|
uint64_t dbuf_cache_target = dbuf_cache_target_bytes();
|
|
|
|
return (dbuf_cache_target -
|
|
|
|
(dbuf_cache_target * dbuf_cache_lowater_pct) / 100);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
static inline boolean_t
|
|
|
|
dbuf_cache_above_lowater(void)
|
|
|
|
{
|
2018-10-01 20:42:05 +03:00
|
|
|
return (zfs_refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
|
2018-07-10 20:49:50 +03:00
|
|
|
dbuf_cache_lowater_bytes());
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evict the oldest eligible dbuf from the dbuf cache.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_evict_one(void)
|
|
|
|
{
|
2021-06-10 19:42:31 +03:00
|
|
|
int idx = multilist_get_random_index(&dbuf_caches[DB_DBUF_CACHE].cache);
|
2018-07-10 20:49:50 +03:00
|
|
|
multilist_sublist_t *mls = multilist_sublist_lock(
|
2021-06-10 19:42:31 +03:00
|
|
|
&dbuf_caches[DB_DBUF_CACHE].cache, idx);
|
2017-11-04 23:25:13 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!MUTEX_HELD(&dbuf_evict_lock));
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
dmu_buf_impl_t *db = multilist_sublist_tail(mls);
|
2016-06-02 07:04:53 +03:00
|
|
|
while (db != NULL && mutex_tryenter(&db->db_mtx) == 0) {
|
|
|
|
db = multilist_sublist_prev(mls, db);
|
|
|
|
}
|
|
|
|
|
|
|
|
DTRACE_PROBE2(dbuf__evict__one, dmu_buf_impl_t *, db,
|
|
|
|
multilist_sublist_t *, mls);
|
|
|
|
|
|
|
|
if (db != NULL) {
|
|
|
|
multilist_sublist_remove(mls, db);
|
|
|
|
multilist_sublist_unlock(mls);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
|
|
|
&dbuf_caches[DB_DBUF_CACHE].size, db->db.db_size, db);
|
2018-01-29 21:24:52 +03:00
|
|
|
DBUF_STAT_BUMPDOWN(cache_levels[db->db_level]);
|
|
|
|
DBUF_STAT_BUMPDOWN(cache_count);
|
|
|
|
DBUF_STAT_DECR(cache_levels_bytes[db->db_level],
|
|
|
|
db->db.db_size);
|
2018-07-10 20:49:50 +03:00
|
|
|
ASSERT3U(db->db_caching_status, ==, DB_DBUF_CACHE);
|
|
|
|
db->db_caching_status = DB_NO_CACHE;
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_destroy(db);
|
2018-01-29 21:24:52 +03:00
|
|
|
DBUF_STAT_BUMP(cache_total_evicts);
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
|
|
|
multilist_sublist_unlock(mls);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The dbuf evict thread is responsible for aging out dbufs from the
|
|
|
|
* cache. Once the cache has reached it's maximum size, dbufs are removed
|
|
|
|
* and destroyed. The eviction thread will continue running until the size
|
|
|
|
* of the dbuf cache is at or below the maximum size. Once the dbuf is aged
|
|
|
|
* out of the cache it is destroyed and becomes eligible for arc eviction.
|
|
|
|
*/
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
dbuf_evict_thread(void *unused)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused;
|
2016-06-02 07:04:53 +03:00
|
|
|
callb_cpr_t cpr;
|
|
|
|
|
|
|
|
CALLB_CPR_INIT(&cpr, &dbuf_evict_lock, callb_generic_cpr, FTAG);
|
|
|
|
|
|
|
|
mutex_enter(&dbuf_evict_lock);
|
|
|
|
while (!dbuf_evict_thread_exit) {
|
|
|
|
while (!dbuf_cache_above_lowater() && !dbuf_evict_thread_exit) {
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cpr);
|
2020-09-04 06:04:09 +03:00
|
|
|
(void) cv_timedwait_idle_hires(&dbuf_evict_cv,
|
2016-06-02 07:04:53 +03:00
|
|
|
&dbuf_evict_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
|
|
|
|
CALLB_CPR_SAFE_END(&cpr, &dbuf_evict_lock);
|
|
|
|
}
|
|
|
|
mutex_exit(&dbuf_evict_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Keep evicting as long as we're above the low water mark
|
|
|
|
* for the cache. We do this without holding the locks to
|
|
|
|
* minimize lock contention.
|
|
|
|
*/
|
|
|
|
while (dbuf_cache_above_lowater() && !dbuf_evict_thread_exit) {
|
|
|
|
dbuf_evict_one();
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&dbuf_evict_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
dbuf_evict_thread_exit = B_FALSE;
|
|
|
|
cv_broadcast(&dbuf_evict_cv);
|
|
|
|
CALLB_CPR_EXIT(&cpr); /* drops dbuf_evict_lock */
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wake up the dbuf eviction thread if the dbuf cache is at its max size.
|
|
|
|
* If the dbuf cache is at its high water mark, then evict a dbuf from the
|
2022-01-21 19:07:15 +03:00
|
|
|
* dbuf cache using the caller's context.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
|
|
|
static void
|
2020-02-05 22:08:44 +03:00
|
|
|
dbuf_evict_notify(uint64_t size)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
2017-03-29 01:31:49 +03:00
|
|
|
/*
|
|
|
|
* We check if we should evict without holding the dbuf_evict_lock,
|
|
|
|
* because it's OK to occasionally make the wrong decision here,
|
|
|
|
* and grabbing the lock results in massive lock contention.
|
|
|
|
*/
|
2020-02-05 22:08:44 +03:00
|
|
|
if (size > dbuf_cache_target_bytes()) {
|
|
|
|
if (size > dbuf_cache_hiwater_bytes())
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_evict_one();
|
2017-03-29 01:31:49 +03:00
|
|
|
cv_signal(&dbuf_evict_cv);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-01-29 21:24:52 +03:00
|
|
|
static int
|
|
|
|
dbuf_kstat_update(kstat_t *ksp, int rw)
|
|
|
|
{
|
|
|
|
dbuf_stats_t *ds = ksp->ks_data;
|
2022-09-19 22:17:11 +03:00
|
|
|
dbuf_hash_table_t *h = &dbuf_hash_table;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
if (rw == KSTAT_WRITE)
|
2018-01-29 21:24:52 +03:00
|
|
|
return (SET_ERROR(EACCES));
|
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
ds->cache_count.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.cache_count);
|
|
|
|
ds->cache_size_bytes.value.ui64 =
|
|
|
|
zfs_refcount_count(&dbuf_caches[DB_DBUF_CACHE].size);
|
|
|
|
ds->cache_target_bytes.value.ui64 = dbuf_cache_target_bytes();
|
|
|
|
ds->cache_hiwater_bytes.value.ui64 = dbuf_cache_hiwater_bytes();
|
|
|
|
ds->cache_lowater_bytes.value.ui64 = dbuf_cache_lowater_bytes();
|
|
|
|
ds->cache_total_evicts.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.cache_total_evicts);
|
|
|
|
for (int i = 0; i < DN_MAX_LEVELS; i++) {
|
|
|
|
ds->cache_levels[i].value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.cache_levels[i]);
|
|
|
|
ds->cache_levels_bytes[i].value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.cache_levels_bytes[i]);
|
|
|
|
}
|
|
|
|
ds->hash_hits.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.hash_hits);
|
|
|
|
ds->hash_misses.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.hash_misses);
|
|
|
|
ds->hash_collisions.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.hash_collisions);
|
|
|
|
ds->hash_chains.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.hash_chains);
|
|
|
|
ds->hash_insert_race.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.hash_insert_race);
|
2022-09-19 22:17:11 +03:00
|
|
|
ds->hash_table_count.value.ui64 = h->hash_table_mask + 1;
|
|
|
|
ds->hash_mutex_count.value.ui64 = h->hash_mutex_mask + 1;
|
2021-06-17 03:19:34 +03:00
|
|
|
ds->metadata_cache_count.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.metadata_cache_count);
|
|
|
|
ds->metadata_cache_size_bytes.value.ui64 = zfs_refcount_count(
|
|
|
|
&dbuf_caches[DB_DBUF_METADATA_CACHE].size);
|
|
|
|
ds->metadata_cache_overflow.value.ui64 =
|
|
|
|
wmsum_value(&dbuf_sums.metadata_cache_overflow);
|
2018-01-29 21:24:52 +03:00
|
|
|
return (0);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
dbuf_init(void)
|
|
|
|
{
|
2022-09-19 22:17:11 +03:00
|
|
|
uint64_t hmsize, hsize = 1ULL << 16;
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_hash_table_t *h = &dbuf_hash_table;
|
|
|
|
|
|
|
|
/*
|
2021-07-01 18:30:31 +03:00
|
|
|
* The hash table is big enough to fill one eighth of physical memory
|
2015-08-31 04:59:23 +03:00
|
|
|
* with an average block size of zfs_arc_average_blocksize (default 8K).
|
|
|
|
* By default, the table will take up
|
|
|
|
* totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-07-01 18:30:31 +03:00
|
|
|
while (hsize * zfs_arc_average_blocksize < arc_all_memory() / 8)
|
2008-11-20 23:01:55 +03:00
|
|
|
hsize <<= 1;
|
|
|
|
|
2022-09-19 22:17:11 +03:00
|
|
|
h->hash_table = NULL;
|
|
|
|
while (h->hash_table == NULL) {
|
|
|
|
h->hash_table_mask = hsize - 1;
|
|
|
|
|
|
|
|
h->hash_table = vmem_zalloc(hsize * sizeof (void *), KM_SLEEP);
|
|
|
|
if (h->hash_table == NULL)
|
|
|
|
hsize >>= 1;
|
|
|
|
|
|
|
|
ASSERT3U(hsize, >=, 1ULL << 10);
|
|
|
|
}
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
2022-09-19 22:17:11 +03:00
|
|
|
* The hash table buckets are protected by an array of mutexes where
|
|
|
|
* each mutex is reponsible for protecting 128 buckets. A minimum
|
|
|
|
* array size of 8192 is targeted to avoid contention.
|
2013-11-01 23:26:11 +04:00
|
|
|
*/
|
2022-09-19 22:17:11 +03:00
|
|
|
if (dbuf_mutex_cache_shift == 0)
|
|
|
|
hmsize = MAX(hsize >> 7, 1ULL << 13);
|
|
|
|
else
|
|
|
|
hmsize = 1ULL << MIN(dbuf_mutex_cache_shift, 24);
|
|
|
|
|
|
|
|
h->hash_mutexes = NULL;
|
|
|
|
while (h->hash_mutexes == NULL) {
|
|
|
|
h->hash_mutex_mask = hmsize - 1;
|
|
|
|
|
|
|
|
h->hash_mutexes = vmem_zalloc(hmsize * sizeof (kmutex_t),
|
|
|
|
KM_SLEEP);
|
|
|
|
if (h->hash_mutexes == NULL)
|
|
|
|
hmsize >>= 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_kmem_cache = kmem_cache_create("dmu_buf_impl_t",
|
2008-11-20 23:01:55 +03:00
|
|
|
sizeof (dmu_buf_impl_t),
|
|
|
|
0, dbuf_cons, dbuf_dest, NULL, NULL, NULL, 0);
|
|
|
|
|
2022-09-19 22:17:11 +03:00
|
|
|
for (int i = 0; i < hmsize; i++)
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_init(&h->hash_mutexes[i], NULL, MUTEX_DEFAULT, NULL);
|
2013-10-03 04:11:19 +04:00
|
|
|
|
|
|
|
dbuf_stats_init(h);
|
2015-04-02 06:44:32 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* All entries are queued via taskq_dispatch_ent(), so min/maxalloc
|
|
|
|
* configuration is not required.
|
|
|
|
*/
|
2015-07-24 20:08:31 +03:00
|
|
|
dbu_evict_taskq = taskq_create("dbu_evict", 1, defclsyspri, 0, 0, 0);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2018-07-10 20:49:50 +03:00
|
|
|
for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_create(&dbuf_caches[dcs].cache,
|
|
|
|
sizeof (dmu_buf_impl_t),
|
2018-07-10 20:49:50 +03:00
|
|
|
offsetof(dmu_buf_impl_t, db_cache_link),
|
|
|
|
dbuf_cache_multilist_index_func);
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_create(&dbuf_caches[dcs].size);
|
2018-07-10 20:49:50 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
dbuf_evict_thread_exit = B_FALSE;
|
|
|
|
mutex_init(&dbuf_evict_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&dbuf_evict_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
dbuf_cache_evict_thread = thread_create(NULL, 0, dbuf_evict_thread,
|
|
|
|
NULL, 0, &p0, TS_RUN, minclsyspri);
|
2018-01-29 21:24:52 +03:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&dbuf_sums.cache_count, 0);
|
|
|
|
wmsum_init(&dbuf_sums.cache_total_evicts, 0);
|
2022-09-19 22:17:11 +03:00
|
|
|
for (int i = 0; i < DN_MAX_LEVELS; i++) {
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&dbuf_sums.cache_levels[i], 0);
|
|
|
|
wmsum_init(&dbuf_sums.cache_levels_bytes[i], 0);
|
|
|
|
}
|
|
|
|
wmsum_init(&dbuf_sums.hash_hits, 0);
|
|
|
|
wmsum_init(&dbuf_sums.hash_misses, 0);
|
|
|
|
wmsum_init(&dbuf_sums.hash_collisions, 0);
|
|
|
|
wmsum_init(&dbuf_sums.hash_chains, 0);
|
|
|
|
wmsum_init(&dbuf_sums.hash_insert_race, 0);
|
|
|
|
wmsum_init(&dbuf_sums.metadata_cache_count, 0);
|
|
|
|
wmsum_init(&dbuf_sums.metadata_cache_overflow, 0);
|
|
|
|
|
2018-01-29 21:24:52 +03:00
|
|
|
dbuf_ksp = kstat_create("zfs", 0, "dbufstats", "misc",
|
|
|
|
KSTAT_TYPE_NAMED, sizeof (dbuf_stats) / sizeof (kstat_named_t),
|
|
|
|
KSTAT_FLAG_VIRTUAL);
|
|
|
|
if (dbuf_ksp != NULL) {
|
2022-09-19 22:17:11 +03:00
|
|
|
for (int i = 0; i < DN_MAX_LEVELS; i++) {
|
2018-01-29 21:24:52 +03:00
|
|
|
snprintf(dbuf_stats.cache_levels[i].name,
|
|
|
|
KSTAT_STRLEN, "cache_level_%d", i);
|
|
|
|
dbuf_stats.cache_levels[i].data_type =
|
|
|
|
KSTAT_DATA_UINT64;
|
|
|
|
snprintf(dbuf_stats.cache_levels_bytes[i].name,
|
|
|
|
KSTAT_STRLEN, "cache_level_%d_bytes", i);
|
|
|
|
dbuf_stats.cache_levels_bytes[i].data_type =
|
|
|
|
KSTAT_DATA_UINT64;
|
|
|
|
}
|
2020-02-04 19:49:12 +03:00
|
|
|
dbuf_ksp->ks_data = &dbuf_stats;
|
|
|
|
dbuf_ksp->ks_update = dbuf_kstat_update;
|
|
|
|
kstat_install(dbuf_ksp);
|
2018-01-29 21:24:52 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dbuf_fini(void)
|
|
|
|
{
|
|
|
|
dbuf_hash_table_t *h = &dbuf_hash_table;
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
dbuf_stats_destroy();
|
|
|
|
|
2022-09-19 22:17:11 +03:00
|
|
|
for (int i = 0; i < (h->hash_mutex_mask + 1); i++)
|
2022-09-19 21:07:15 +03:00
|
|
|
mutex_destroy(&h->hash_mutexes[i]);
|
2022-09-19 22:17:11 +03:00
|
|
|
|
2010-08-26 22:46:09 +04:00
|
|
|
vmem_free(h->hash_table, (h->hash_table_mask + 1) * sizeof (void *));
|
2022-09-19 22:17:11 +03:00
|
|
|
vmem_free(h->hash_mutexes, (h->hash_mutex_mask + 1) *
|
|
|
|
sizeof (kmutex_t));
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
kmem_cache_destroy(dbuf_kmem_cache);
|
2015-04-02 06:44:32 +03:00
|
|
|
taskq_destroy(dbu_evict_taskq);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
mutex_enter(&dbuf_evict_lock);
|
|
|
|
dbuf_evict_thread_exit = B_TRUE;
|
|
|
|
while (dbuf_evict_thread_exit) {
|
|
|
|
cv_signal(&dbuf_evict_cv);
|
|
|
|
cv_wait(&dbuf_evict_cv, &dbuf_evict_lock);
|
|
|
|
}
|
|
|
|
mutex_exit(&dbuf_evict_lock);
|
|
|
|
|
|
|
|
mutex_destroy(&dbuf_evict_lock);
|
|
|
|
cv_destroy(&dbuf_evict_cv);
|
|
|
|
|
2018-07-10 20:49:50 +03:00
|
|
|
for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_destroy(&dbuf_caches[dcs].size);
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_destroy(&dbuf_caches[dcs].cache);
|
2018-07-10 20:49:50 +03:00
|
|
|
}
|
2018-01-29 21:24:52 +03:00
|
|
|
|
|
|
|
if (dbuf_ksp != NULL) {
|
|
|
|
kstat_delete(dbuf_ksp);
|
|
|
|
dbuf_ksp = NULL;
|
|
|
|
}
|
2021-06-17 03:19:34 +03:00
|
|
|
|
|
|
|
wmsum_fini(&dbuf_sums.cache_count);
|
|
|
|
wmsum_fini(&dbuf_sums.cache_total_evicts);
|
2022-09-19 22:17:11 +03:00
|
|
|
for (int i = 0; i < DN_MAX_LEVELS; i++) {
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&dbuf_sums.cache_levels[i]);
|
|
|
|
wmsum_fini(&dbuf_sums.cache_levels_bytes[i]);
|
|
|
|
}
|
|
|
|
wmsum_fini(&dbuf_sums.hash_hits);
|
|
|
|
wmsum_fini(&dbuf_sums.hash_misses);
|
|
|
|
wmsum_fini(&dbuf_sums.hash_collisions);
|
|
|
|
wmsum_fini(&dbuf_sums.hash_chains);
|
|
|
|
wmsum_fini(&dbuf_sums.hash_insert_race);
|
|
|
|
wmsum_fini(&dbuf_sums.metadata_cache_count);
|
|
|
|
wmsum_fini(&dbuf_sums.metadata_cache_overflow);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Other stuff.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
static void
|
|
|
|
dbuf_verify(dmu_buf_impl_t *db)
|
|
|
|
{
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_t *dn;
|
2010-05-29 00:45:14 +04:00
|
|
|
dbuf_dirty_record_t *dr;
|
2020-02-05 22:07:19 +03:00
|
|
|
uint32_t txg_prev;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
|
|
|
if (!(zfs_flags & ZFS_DEBUG_DBUF_VERIFY))
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(db->db_objset != NULL);
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dn == NULL) {
|
|
|
|
ASSERT(db->db_parent == NULL);
|
|
|
|
ASSERT(db->db_blkptr == NULL);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(db->db.db_object, ==, dn->dn_object);
|
|
|
|
ASSERT3P(db->db_objset, ==, dn->dn_objset);
|
|
|
|
ASSERT3U(db->db_level, <, dn->dn_nlevels);
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(db->db_blkid == DMU_BONUS_BLKID ||
|
|
|
|
db->db_blkid == DMU_SPILL_BLKID ||
|
2015-04-03 06:14:28 +03:00
|
|
|
!avl_is_empty(&dn->dn_dbufs));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
|
|
|
ASSERT(dn != NULL);
|
|
|
|
ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
|
|
|
|
ASSERT3U(db->db.db_offset, ==, DMU_BONUS_BLKID);
|
|
|
|
} else if (db->db_blkid == DMU_SPILL_BLKID) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(dn != NULL);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(db->db.db_offset);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3U(db->db.db_offset, ==, db->db_blkid * db->db.db_size);
|
|
|
|
}
|
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
if ((dr = list_head(&db->db_dirty_records)) != NULL) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(dr->dr_dbuf == db);
|
2020-02-05 22:07:19 +03:00
|
|
|
txg_prev = dr->dr_txg;
|
|
|
|
for (dr = list_next(&db->db_dirty_records, dr); dr != NULL;
|
|
|
|
dr = list_next(&db->db_dirty_records, dr)) {
|
|
|
|
ASSERT(dr->dr_dbuf == db);
|
|
|
|
ASSERT(txg_prev > dr->dr_txg);
|
|
|
|
txg_prev = dr->dr_txg;
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* We can't assert that db_size matches dn_datablksz because it
|
|
|
|
* can be momentarily different when another thread is doing
|
|
|
|
* dnode_set_blksz().
|
|
|
|
*/
|
|
|
|
if (db->db_level == 0 && db->db.db_object == DMU_META_DNODE_OBJECT) {
|
2010-05-29 00:45:14 +04:00
|
|
|
dr = db->db_data_pending;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* It should only be modified in syncing context, so
|
|
|
|
* make sure we only have one copy of the data.
|
|
|
|
*/
|
|
|
|
ASSERT(dr == NULL || dr->dt.dl.dr_data == db->db_buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* verify db->db_blkptr */
|
|
|
|
if (db->db_blkptr) {
|
|
|
|
if (db->db_parent == dn->dn_dbuf) {
|
|
|
|
/* db is pointed to by the dnode */
|
|
|
|
/* ASSERT3U(db->db_blkid, <, dn->dn_nblkptr); */
|
2009-07-03 02:44:48 +04:00
|
|
|
if (DMU_OBJECT_IS_SPECIAL(db->db.db_object))
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_parent == NULL);
|
|
|
|
else
|
|
|
|
ASSERT(db->db_parent != NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid != DMU_SPILL_BLKID)
|
|
|
|
ASSERT3P(db->db_blkptr, ==,
|
|
|
|
&dn->dn_phys->dn_blkptr[db->db_blkid]);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
/* db is pointed to by an indirect block */
|
2019-12-05 23:37:00 +03:00
|
|
|
int epb __maybe_unused = db->db_parent->db.db_size >>
|
|
|
|
SPA_BLKPTRSHIFT;
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3U(db->db_parent->db_level, ==, db->db_level+1);
|
|
|
|
ASSERT3U(db->db_parent->db.db_object, ==,
|
|
|
|
db->db.db_object);
|
|
|
|
/*
|
|
|
|
* dnode_grow_indblksz() can make this fail if we don't
|
2019-07-08 23:18:50 +03:00
|
|
|
* have the parent's rwlock. XXX indblksz no longer
|
2008-11-20 23:01:55 +03:00
|
|
|
* grows. safe to do this now?
|
|
|
|
*/
|
2019-07-08 23:18:50 +03:00
|
|
|
if (RW_LOCK_HELD(&db->db_parent->db_rwlock)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3P(db->db_blkptr, ==,
|
|
|
|
((blkptr_t *)db->db_parent->db.db_data +
|
|
|
|
db->db_blkid % epb));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if ((db->db_blkptr == NULL || BP_IS_HOLE(db->db_blkptr)) &&
|
2010-05-29 00:45:14 +04:00
|
|
|
(db->db_buf == NULL || db->db_buf->b_data) &&
|
|
|
|
db->db.db_data && db->db_blkid != DMU_BONUS_BLKID &&
|
2023-03-15 01:00:54 +03:00
|
|
|
db->db_state != DB_FILL && (dn == NULL || !dn->dn_free_txg)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If the blkptr isn't set but they have nonzero data,
|
|
|
|
* it had better be dirty, otherwise we'll lose that
|
|
|
|
* data when we evict this buffer.
|
2016-05-15 18:02:28 +03:00
|
|
|
*
|
|
|
|
* There is an exception to this rule for indirect blocks; in
|
|
|
|
* this case, if the indirect block is a hole, we fill in a few
|
|
|
|
* fields on each of the child blocks (importantly, birth time)
|
|
|
|
* to prevent hole birth times from being lost when you
|
|
|
|
* partially fill in a hole.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (db->db_dirtycnt == 0) {
|
2016-05-15 18:02:28 +03:00
|
|
|
if (db->db_level == 0) {
|
|
|
|
uint64_t *buf = db->db.db_data;
|
|
|
|
int i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-05-15 18:02:28 +03:00
|
|
|
for (i = 0; i < db->db.db_size >> 3; i++) {
|
|
|
|
ASSERT(buf[i] == 0);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
blkptr_t *bps = db->db.db_data;
|
|
|
|
ASSERT3U(1 << DB_DNODE(db)->dn_indblkshift, ==,
|
|
|
|
db->db.db_size);
|
|
|
|
/*
|
|
|
|
* We want to verify that all the blkptrs in the
|
|
|
|
* indirect block are holes, but we may have
|
|
|
|
* automatically set up a few fields for them.
|
|
|
|
* We iterate through each blkptr and verify
|
|
|
|
* they only have those fields set.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0;
|
2016-05-15 18:02:28 +03:00
|
|
|
i < db->db.db_size / sizeof (blkptr_t);
|
|
|
|
i++) {
|
|
|
|
blkptr_t *bp = &bps[i];
|
|
|
|
ASSERT(ZIO_CHECKSUM_IS_ZERO(
|
|
|
|
&bp->blk_cksum));
|
|
|
|
ASSERT(
|
|
|
|
DVA_IS_EMPTY(&bp->blk_dva[0]) &&
|
|
|
|
DVA_IS_EMPTY(&bp->blk_dva[1]) &&
|
|
|
|
DVA_IS_EMPTY(&bp->blk_dva[2]));
|
|
|
|
ASSERT0(bp->blk_fill);
|
|
|
|
ASSERT0(bp->blk_pad[0]);
|
|
|
|
ASSERT0(bp->blk_pad[1]);
|
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp));
|
|
|
|
ASSERT(BP_IS_HOLE(bp));
|
|
|
|
ASSERT0(bp->blk_phys_birth);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
static void
|
|
|
|
dbuf_clear_data(dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
dbuf_evict_user(db);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(db->db_buf, ==, NULL);
|
2015-04-02 06:44:32 +03:00
|
|
|
db->db.db_data = NULL;
|
2020-02-18 22:21:37 +03:00
|
|
|
if (db->db_state != DB_NOFILL) {
|
2015-04-02 06:44:32 +03:00
|
|
|
db->db_state = DB_UNCACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "clear data");
|
|
|
|
}
|
2015-04-02 06:44:32 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dbuf_set_data(dmu_buf_impl_t *db, arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2015-04-02 06:44:32 +03:00
|
|
|
ASSERT(buf != NULL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_buf = buf;
|
2015-04-02 06:44:32 +03:00
|
|
|
ASSERT(buf->b_data != NULL);
|
|
|
|
db->db.db_data = buf->b_data;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-02-18 22:21:37 +03:00
|
|
|
static arc_buf_t *
|
|
|
|
dbuf_alloc_arcbuf(dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
spa_t *spa = db->db_objset->os_spa;
|
|
|
|
|
|
|
|
return (arc_alloc_buf(spa, db, DBUF_GET_BUFC_TYPE(db), db->db.db_size));
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Loan out an arc_buf for read. Return the loaned arc_buf.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
|
|
|
dbuf_loan_arcbuf(dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
arc_buf_t *abuf;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&db->db_mtx);
|
2018-10-01 20:42:05 +03:00
|
|
|
if (arc_released(db->db_buf) || zfs_refcount_count(&db->db_holds) > 1) {
|
2010-05-29 00:45:14 +04:00
|
|
|
int blksz = db->db.db_size;
|
2013-12-09 22:37:51 +04:00
|
|
|
spa_t *spa = db->db_objset->os_spa;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&db->db_mtx);
|
2016-07-11 20:45:52 +03:00
|
|
|
abuf = arc_loan_buf(spa, B_FALSE, blksz);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(abuf->b_data, db->db.db_data, blksz);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
abuf = db->db_buf;
|
|
|
|
arc_loan_inuse_buf(abuf, db);
|
2016-06-02 07:04:53 +03:00
|
|
|
db->db_buf = NULL;
|
2015-04-02 06:44:32 +03:00
|
|
|
dbuf_clear_data(db);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
}
|
|
|
|
return (abuf);
|
|
|
|
}
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
/*
|
|
|
|
* Calculate which level n block references the data at the level 0 offset
|
|
|
|
* provided.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t
|
2016-08-31 11:12:08 +03:00
|
|
|
dbuf_whichblock(const dnode_t *dn, const int64_t level, const uint64_t offset)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-12-22 04:31:57 +03:00
|
|
|
if (dn->dn_datablkshift != 0 && dn->dn_indblkshift != 0) {
|
|
|
|
/*
|
|
|
|
* The level n blkid is equal to the level 0 blkid divided by
|
|
|
|
* the number of level 0s in a level n block.
|
|
|
|
*
|
|
|
|
* The level 0 blkid is offset >> datablkshift =
|
|
|
|
* offset / 2^datablkshift.
|
|
|
|
*
|
|
|
|
* The number of level 0s in a level n is the number of block
|
|
|
|
* pointers in an indirect block, raised to the power of level.
|
|
|
|
* This is 2^(indblkshift - SPA_BLKPTRSHIFT)^level =
|
|
|
|
* 2^(level*(indblkshift - SPA_BLKPTRSHIFT)).
|
|
|
|
*
|
|
|
|
* Thus, the level n blkid is: offset /
|
2018-08-13 23:33:47 +03:00
|
|
|
* ((2^datablkshift)*(2^(level*(indblkshift-SPA_BLKPTRSHIFT))))
|
2015-12-22 04:31:57 +03:00
|
|
|
* = offset / 2^(datablkshift + level *
|
|
|
|
* (indblkshift - SPA_BLKPTRSHIFT))
|
|
|
|
* = offset >> (datablkshift + level *
|
|
|
|
* (indblkshift - SPA_BLKPTRSHIFT))
|
|
|
|
*/
|
2016-08-31 11:12:08 +03:00
|
|
|
|
|
|
|
const unsigned exp = dn->dn_datablkshift +
|
|
|
|
level * (dn->dn_indblkshift - SPA_BLKPTRSHIFT);
|
|
|
|
|
|
|
|
if (exp >= 8 * sizeof (offset)) {
|
|
|
|
/* This only happens on the highest indirection level */
|
|
|
|
ASSERT3U(level, ==, dn->dn_nlevels - 1);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT3U(exp, <, 8 * sizeof (offset));
|
|
|
|
|
|
|
|
return (offset >> exp);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3U(offset, <, dn->dn_datablksz);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-08 23:18:50 +03:00
|
|
|
/*
|
|
|
|
* This function is used to lock the parent of the provided dbuf. This should be
|
|
|
|
* used when modifying or reading db_blkptr.
|
|
|
|
*/
|
|
|
|
db_lock_type_t
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_buf_lock_parent(dmu_buf_impl_t *db, krw_t rw, const void *tag)
|
2019-07-08 23:18:50 +03:00
|
|
|
{
|
|
|
|
enum db_lock_type ret = DLT_NONE;
|
|
|
|
if (db->db_parent != NULL) {
|
|
|
|
rw_enter(&db->db_parent->db_rwlock, rw);
|
|
|
|
ret = DLT_PARENT;
|
|
|
|
} else if (dmu_objset_ds(db->db_objset) != NULL) {
|
|
|
|
rrw_enter(&dmu_objset_ds(db->db_objset)->ds_bp_rwlock, rw,
|
|
|
|
tag);
|
|
|
|
ret = DLT_OBJSET;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We only return a DLT_NONE lock when it's the top-most indirect block
|
|
|
|
* of the meta-dnode of the MOS.
|
|
|
|
*/
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to pass the lock type in because it's possible that the block will
|
|
|
|
* move from being the topmost indirect block in a dnode (and thus, have no
|
|
|
|
* parent) to not the top-most via an indirection increase. This would cause a
|
|
|
|
* panic if we didn't pass the lock type in.
|
|
|
|
*/
|
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_buf_unlock_parent(dmu_buf_impl_t *db, db_lock_type_t type, const void *tag)
|
2019-07-08 23:18:50 +03:00
|
|
|
{
|
|
|
|
if (type == DLT_PARENT)
|
|
|
|
rw_exit(&db->db_parent->db_rwlock);
|
|
|
|
else if (type == DLT_OBJSET)
|
|
|
|
rrw_exit(&dmu_objset_ds(db->db_objset)->ds_bp_rwlock, tag);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2017-11-16 04:27:01 +03:00
|
|
|
dbuf_read_done(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,
|
|
|
|
arc_buf_t *buf, void *vdb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zb, (void) bp;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db = vdb;
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
ASSERT3U(db->db_state, ==, DB_READ);
|
|
|
|
/*
|
|
|
|
* All reads are synchronous, so we must have a hold on the dbuf
|
|
|
|
*/
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_count(&db->db_holds) > 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_buf == NULL);
|
|
|
|
ASSERT(db->db.db_data == NULL);
|
2018-08-29 21:33:33 +03:00
|
|
|
if (buf == NULL) {
|
|
|
|
/* i/o error */
|
|
|
|
ASSERT(zio == NULL || zio->io_error != 0);
|
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
|
|
|
ASSERT3P(db->db_buf, ==, NULL);
|
|
|
|
db->db_state = DB_UNCACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "i/o error");
|
2018-08-29 21:33:33 +03:00
|
|
|
} else if (db->db_level == 0 && db->db_freed_in_flight) {
|
|
|
|
/* freed in flight */
|
|
|
|
ASSERT(zio == NULL || zio->io_error == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_release(buf, db);
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(buf->b_data, 0, db->db.db_size);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_freeze(buf);
|
|
|
|
db->db_freed_in_flight = FALSE;
|
|
|
|
dbuf_set_data(db, buf);
|
|
|
|
db->db_state = DB_CACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "freed in flight");
|
2018-08-29 21:33:33 +03:00
|
|
|
} else {
|
|
|
|
/* success */
|
|
|
|
ASSERT(zio == NULL || zio->io_error == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_set_data(db, buf);
|
|
|
|
db->db_state = DB_CACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "successful read");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
cv_broadcast(&db->db_changed);
|
2018-08-01 00:51:15 +03:00
|
|
|
dbuf_rele_and_unlock(db, NULL, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-02-18 22:21:37 +03:00
|
|
|
/*
|
|
|
|
* Shortcut for performing reads on bonus dbufs. Returns
|
|
|
|
* an error if we fail to verify the dnode associated with
|
|
|
|
* a decrypted block. Otherwise success.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dbuf_read_bonus(dmu_buf_impl_t *db, dnode_t *dn, uint32_t flags)
|
|
|
|
{
|
|
|
|
int bonuslen, max_bonuslen, err;
|
|
|
|
|
|
|
|
err = dbuf_read_verify_dnode_crypt(db, flags);
|
|
|
|
if (err)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
bonuslen = MIN(dn->dn_bonuslen, dn->dn_phys->dn_bonuslen);
|
|
|
|
max_bonuslen = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots);
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
ASSERT(DB_DNODE_HELD(db));
|
|
|
|
ASSERT3U(bonuslen, <=, db->db.db_size);
|
|
|
|
db->db.db_data = kmem_alloc(max_bonuslen, KM_SLEEP);
|
|
|
|
arc_space_consume(max_bonuslen, ARC_SPACE_BONUS);
|
|
|
|
if (bonuslen < max_bonuslen)
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(db->db.db_data, 0, max_bonuslen);
|
2020-02-18 22:21:37 +03:00
|
|
|
if (bonuslen)
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(db->db.db_data, DN_BONUS(dn->dn_phys), bonuslen);
|
2020-02-18 22:21:37 +03:00
|
|
|
db->db_state = DB_CACHED;
|
|
|
|
DTRACE_SET_STATE(db, "bonus buffer filled");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2023-03-10 22:59:53 +03:00
|
|
|
dbuf_handle_indirect_hole(dmu_buf_impl_t *db, dnode_t *dn, blkptr_t *dbbp)
|
2020-02-18 22:21:37 +03:00
|
|
|
{
|
|
|
|
blkptr_t *bps = db->db.db_data;
|
|
|
|
uint32_t indbs = 1ULL << dn->dn_indblkshift;
|
|
|
|
int n_bps = indbs >> SPA_BLKPTRSHIFT;
|
|
|
|
|
|
|
|
for (int i = 0; i < n_bps; i++) {
|
|
|
|
blkptr_t *bp = &bps[i];
|
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
ASSERT3U(BP_GET_LSIZE(dbbp), ==, indbs);
|
|
|
|
BP_SET_LSIZE(bp, BP_GET_LEVEL(dbbp) == 1 ?
|
|
|
|
dn->dn_datablksz : BP_GET_LSIZE(dbbp));
|
|
|
|
BP_SET_TYPE(bp, BP_GET_TYPE(dbbp));
|
|
|
|
BP_SET_LEVEL(bp, BP_GET_LEVEL(dbbp) - 1);
|
|
|
|
BP_SET_BIRTH(bp, dbbp->blk_birth, 0);
|
2020-02-18 22:21:37 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle reads on dbufs that are holes, if necessary. This function
|
|
|
|
* requires that the dbuf's mutex is held. Returns success (0) if action
|
|
|
|
* was taken, ENOENT if no action was taken.
|
|
|
|
*/
|
|
|
|
static int
|
2023-03-10 22:59:53 +03:00
|
|
|
dbuf_read_hole(dmu_buf_impl_t *db, dnode_t *dn, blkptr_t *bp)
|
2020-02-18 22:21:37 +03:00
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
int is_hole = bp == NULL || BP_IS_HOLE(bp);
|
2020-02-18 22:21:37 +03:00
|
|
|
/*
|
|
|
|
* For level 0 blocks only, if the above check fails:
|
|
|
|
* Recheck BP_IS_HOLE() after dnode_block_freed() in case dnode_sync()
|
|
|
|
* processes the delete record and clears the bp while we are waiting
|
|
|
|
* for the dn_mtx (resulting in a "no" from block_freed).
|
|
|
|
*/
|
2023-03-10 22:59:53 +03:00
|
|
|
if (!is_hole && db->db_level == 0)
|
|
|
|
is_hole = dnode_block_freed(dn, db->db_blkid) || BP_IS_HOLE(bp);
|
2020-02-18 22:21:37 +03:00
|
|
|
|
|
|
|
if (is_hole) {
|
|
|
|
dbuf_set_data(db, dbuf_alloc_arcbuf(db));
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(db->db.db_data, 0, db->db.db_size);
|
2020-02-18 22:21:37 +03:00
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
if (bp != NULL && db->db_level > 0 && BP_IS_HOLE(bp) &&
|
|
|
|
bp->blk_birth != 0) {
|
|
|
|
dbuf_handle_indirect_hole(db, dn, bp);
|
2020-02-18 22:21:37 +03:00
|
|
|
}
|
|
|
|
db->db_state = DB_CACHED;
|
|
|
|
DTRACE_SET_STATE(db, "hole read satisfied");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
return (ENOENT);
|
|
|
|
}
|
2018-06-28 19:20:34 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This function ensures that, when doing a decrypting read of a block,
|
|
|
|
* we make sure we have decrypted the dnode associated with it. We must do
|
|
|
|
* this so that we ensure we are fully authenticating the checksum-of-MACs
|
|
|
|
* tree from the root of the objset down to this block. Indirect blocks are
|
|
|
|
* always verified against their secure checksum-of-MACs assuming that the
|
|
|
|
* dnode containing them is correct. Now that we are doing a decrypting read,
|
|
|
|
* we can be sure that the key is loaded and verify that assumption. This is
|
|
|
|
* especially important considering that we always read encrypted dnode
|
|
|
|
* blocks as raw data (without verifying their MACs) to start, and
|
|
|
|
* decrypt / authenticate them when we need to read an encrypted bonus buffer.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dbuf_read_verify_dnode_crypt(dmu_buf_impl_t *db, uint32_t flags)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
objset_t *os = db->db_objset;
|
|
|
|
arc_buf_t *dnode_abuf;
|
|
|
|
dnode_t *dn;
|
|
|
|
zbookmark_phys_t zb;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
2022-11-29 20:49:02 +03:00
|
|
|
if ((flags & DB_RF_NO_DECRYPT) != 0 ||
|
|
|
|
!os->os_encrypted || os->os_raw_receive)
|
2018-06-28 19:20:34 +03:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
|
|
|
dnode_abuf = (dn->dn_dbuf != NULL) ? dn->dn_dbuf->db_buf : NULL;
|
|
|
|
|
|
|
|
if (dnode_abuf == NULL || !arc_is_encrypted(dnode_abuf)) {
|
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
SET_BOOKMARK(&zb, dmu_objset_id(os),
|
|
|
|
DMU_META_DNODE_OBJECT, 0, dn->dn_dbuf->db_blkid);
|
|
|
|
err = arc_untransform(dnode_abuf, os->os_spa, &zb, B_TRUE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An error code of EACCES tells us that the key is still not
|
|
|
|
* available. This is ok if we are only reading authenticated
|
|
|
|
* (and therefore non-encrypted) blocks.
|
|
|
|
*/
|
|
|
|
if (err == EACCES && ((db->db_blkid != DMU_BONUS_BLKID &&
|
|
|
|
!DMU_OT_IS_ENCRYPTED(dn->dn_type)) ||
|
|
|
|
(db->db_blkid == DMU_BONUS_BLKID &&
|
|
|
|
!DMU_OT_IS_ENCRYPTED(dn->dn_bonustype))))
|
|
|
|
err = 0;
|
|
|
|
|
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2019-07-08 23:18:50 +03:00
|
|
|
/*
|
|
|
|
* Drops db_mtx and the parent lock specified by dblt and tag before
|
|
|
|
* returning.
|
|
|
|
*/
|
2014-09-10 22:59:03 +04:00
|
|
|
static int
|
2024-04-04 01:04:26 +03:00
|
|
|
dbuf_read_impl(dmu_buf_impl_t *db, dnode_t *dn, zio_t *zio, uint32_t flags,
|
2022-04-19 21:38:30 +03:00
|
|
|
db_lock_type_t dblt, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t zb;
|
2014-12-06 20:24:32 +03:00
|
|
|
uint32_t aflags = ARC_FLAG_NOWAIT;
|
2020-02-18 22:21:37 +03:00
|
|
|
int err, zio_flags;
|
2024-04-08 22:03:18 +03:00
|
|
|
blkptr_t bp, *bpp = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2023-03-10 22:59:53 +03:00
|
|
|
ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_buf == NULL);
|
2019-07-08 23:18:50 +03:00
|
|
|
ASSERT(db->db_parent == NULL ||
|
|
|
|
RW_LOCK_HELD(&db->db_parent->db_rwlock));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
2020-02-18 22:21:37 +03:00
|
|
|
err = dbuf_read_bonus(db, dn, flags);
|
|
|
|
goto early_unlock;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2024-04-08 22:03:18 +03:00
|
|
|
/*
|
|
|
|
* If we have a pending block clone, we don't want to read the
|
|
|
|
* underlying block, but the content of the block being cloned,
|
|
|
|
* pointed by the dirty record, so we have the most recent data.
|
|
|
|
* If there is no dirty record, then we hit a race in a sync
|
|
|
|
* process when the dirty record is already removed, while the
|
|
|
|
* dbuf is not yet destroyed. Such case is equivalent to uncached.
|
|
|
|
*/
|
|
|
|
if (db->db_state == DB_NOFILL) {
|
|
|
|
dbuf_dirty_record_t *dr = list_head(&db->db_dirty_records);
|
|
|
|
if (dr != NULL) {
|
|
|
|
if (!dr->dt.dl.dr_brtwrite) {
|
|
|
|
err = EIO;
|
|
|
|
goto early_unlock;
|
|
|
|
}
|
|
|
|
bp = dr->dt.dl.dr_overridden_by;
|
2023-03-10 22:59:53 +03:00
|
|
|
bpp = &bp;
|
|
|
|
}
|
2024-04-08 22:03:18 +03:00
|
|
|
}
|
2023-03-10 22:59:53 +03:00
|
|
|
|
2024-04-08 22:03:18 +03:00
|
|
|
if (bpp == NULL && db->db_blkptr != NULL) {
|
|
|
|
bp = *db->db_blkptr;
|
2023-04-30 12:47:09 +03:00
|
|
|
bpp = &bp;
|
2023-03-10 22:59:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
err = dbuf_read_hole(db, dn, bpp);
|
2020-02-18 22:21:37 +03:00
|
|
|
if (err == 0)
|
|
|
|
goto early_unlock;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
ASSERT(bpp != NULL);
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
/*
|
|
|
|
* Any attempt to read a redacted block should result in an error. This
|
|
|
|
* will never happen under normal conditions, but can be useful for
|
|
|
|
* debugging purposes.
|
|
|
|
*/
|
2023-03-10 22:59:53 +03:00
|
|
|
if (BP_IS_REDACTED(bpp)) {
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(dsl_dataset_feature_is_active(
|
|
|
|
db->db_objset->os_dsl_dataset,
|
|
|
|
SPA_FEATURE_REDACTED_DATASETS));
|
2020-02-18 22:21:37 +03:00
|
|
|
err = SET_ERROR(EIO);
|
|
|
|
goto early_unlock;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
}
|
|
|
|
|
2018-07-02 23:37:48 +03:00
|
|
|
SET_BOOKMARK(&zb, dmu_objset_id(db->db_objset),
|
|
|
|
db->db.db_object, db->db_level, db->db_blkid);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* All bps of an encrypted os should have the encryption bit set.
|
|
|
|
* If this is not true it indicates tampering and we report an error.
|
|
|
|
*/
|
2023-03-10 22:59:53 +03:00
|
|
|
if (db->db_objset->os_encrypted && !BP_USES_CRYPT(bpp)) {
|
2023-05-02 19:24:26 +03:00
|
|
|
spa_log_error(db->db_objset->os_spa, &zb, &bpp->blk_birth);
|
2020-02-18 22:21:37 +03:00
|
|
|
err = SET_ERROR(EIO);
|
|
|
|
goto early_unlock;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
2018-06-28 19:20:34 +03:00
|
|
|
err = dbuf_read_verify_dnode_crypt(db, flags);
|
2020-02-18 22:21:37 +03:00
|
|
|
if (err != 0)
|
|
|
|
goto early_unlock;
|
2018-06-28 19:20:34 +03:00
|
|
|
|
|
|
|
db->db_state = DB_READ;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "read issued");
|
2018-06-28 19:20:34 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (!DBUF_IS_CACHEABLE(db))
|
|
|
|
aflags |= ARC_FLAG_UNCACHED;
|
|
|
|
else if (dbuf_is_l2cacheable(db))
|
2018-06-28 19:20:34 +03:00
|
|
|
aflags |= ARC_FLAG_L2CACHE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_add_ref(db, NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
zio_flags = (flags & DB_RF_CANFAIL) ?
|
|
|
|
ZIO_FLAG_CANFAIL : ZIO_FLAG_MUSTSUCCEED;
|
|
|
|
|
|
|
|
if ((flags & DB_RF_NO_DECRYPT) && BP_IS_PROTECTED(db->db_blkptr))
|
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
2019-07-08 23:18:50 +03:00
|
|
|
/*
|
2023-03-10 22:59:53 +03:00
|
|
|
* The zio layer will copy the provided blkptr later, but we have our
|
|
|
|
* own copy so that we can release the parent's rwlock. We have to
|
|
|
|
* do that so that if dbuf_read_done is called synchronously (on
|
2019-07-08 23:18:50 +03:00
|
|
|
* an l1 cache hit) we don't acquire the db_mtx while holding the
|
|
|
|
* parent's rwlock, which would be a lock ordering violation.
|
|
|
|
*/
|
|
|
|
dmu_buf_unlock_parent(db, dblt, tag);
|
2024-04-04 01:04:26 +03:00
|
|
|
return (arc_read(zio, db->db_objset->os_spa, bpp,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dbuf_read_done, db, ZIO_PRIORITY_SYNC_READ, zio_flags,
|
2024-04-04 01:04:26 +03:00
|
|
|
&aflags, &zb));
|
|
|
|
|
2020-02-18 22:21:37 +03:00
|
|
|
early_unlock:
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
dmu_buf_unlock_parent(db, dblt, tag);
|
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* This is our just-in-time copy function. It makes a copy of buffers that
|
|
|
|
* have been modified in a previous transaction group before we access them in
|
|
|
|
* the current active group.
|
|
|
|
*
|
|
|
|
* This function is used in three places: when we are dirtying a buffer for the
|
|
|
|
* first time in a txg, when we are freeing a range in a dnode that includes
|
|
|
|
* this buffer, and when we are accessing a buffer which was received compressed
|
|
|
|
* and later referenced in a WRITE_BYREF record.
|
|
|
|
*
|
|
|
|
* Note that when we are called from dbuf_free_range() we do not put a hold on
|
|
|
|
* the buffer, we just traverse the active dbuf list for the dnode.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_fix_old_data(dmu_buf_impl_t *db, uint64_t txg)
|
|
|
|
{
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr = list_head(&db->db_dirty_records);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
ASSERT(db->db.db_data != NULL);
|
|
|
|
ASSERT(db->db_level == 0);
|
|
|
|
ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT);
|
|
|
|
|
|
|
|
if (dr == NULL ||
|
|
|
|
(dr->dt.dl.dr_data !=
|
|
|
|
((db->db_blkid == DMU_BONUS_BLKID) ? db->db.db_data : db->db_buf)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the last dirty record for this dbuf has not yet synced
|
|
|
|
* and its referencing the dbuf data, either:
|
|
|
|
* reset the reference to point to a new copy,
|
|
|
|
* or (if there a no active holders)
|
|
|
|
* just null out the current db_data pointer.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
ASSERT3U(dr->dr_txg, >=, txg - 2);
|
2016-07-11 20:45:52 +03:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
|
|
|
dnode_t *dn = DB_DNODE(db);
|
|
|
|
int bonuslen = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots);
|
2016-12-01 02:18:20 +03:00
|
|
|
dr->dt.dl.dr_data = kmem_alloc(bonuslen, KM_SLEEP);
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_space_consume(bonuslen, ARC_SPACE_BONUS);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(dr->dt.dl.dr_data, db->db.db_data, bonuslen);
|
2018-10-01 20:42:05 +03:00
|
|
|
} else if (zfs_refcount_count(&db->db_holds) > db->db_dirtycnt) {
|
2021-06-23 07:39:15 +03:00
|
|
|
dnode_t *dn = DB_DNODE(db);
|
|
|
|
int size = arc_buf_size(db->db_buf);
|
|
|
|
arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
|
|
|
|
spa_t *spa = db->db_objset->os_spa;
|
|
|
|
enum zio_compress compress_type =
|
|
|
|
arc_get_compression(db->db_buf);
|
|
|
|
uint8_t complevel = arc_get_complevel(db->db_buf);
|
|
|
|
|
|
|
|
if (arc_is_encrypted(db->db_buf)) {
|
|
|
|
boolean_t byteorder;
|
|
|
|
uint8_t salt[ZIO_DATA_SALT_LEN];
|
|
|
|
uint8_t iv[ZIO_DATA_IV_LEN];
|
|
|
|
uint8_t mac[ZIO_DATA_MAC_LEN];
|
|
|
|
|
|
|
|
arc_get_raw_params(db->db_buf, &byteorder, salt,
|
|
|
|
iv, mac);
|
|
|
|
dr->dt.dl.dr_data = arc_alloc_raw_buf(spa, db,
|
|
|
|
dmu_objset_id(dn->dn_objset), byteorder, salt, iv,
|
|
|
|
mac, dn->dn_type, size, arc_buf_lsize(db->db_buf),
|
|
|
|
compress_type, complevel);
|
|
|
|
} else if (compress_type != ZIO_COMPRESS_OFF) {
|
|
|
|
ASSERT3U(type, ==, ARC_BUFC_DATA);
|
|
|
|
dr->dt.dl.dr_data = arc_alloc_compressed_buf(spa, db,
|
|
|
|
size, arc_buf_lsize(db->db_buf), compress_type,
|
|
|
|
complevel);
|
|
|
|
} else {
|
|
|
|
dr->dt.dl.dr_data = arc_alloc_buf(spa, db, type, size);
|
|
|
|
}
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(dr->dt.dl.dr_data->b_data, db->db.db_data, size);
|
2016-07-11 20:45:52 +03:00
|
|
|
} else {
|
|
|
|
db->db_buf = NULL;
|
|
|
|
dbuf_clear_data(db);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2024-04-04 01:04:26 +03:00
|
|
|
dbuf_read(dmu_buf_impl_t *db, zio_t *pio, uint32_t flags)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int err = 0;
|
2013-12-09 22:37:51 +04:00
|
|
|
boolean_t prefetch;
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_t *dn;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't have to hold the mutex to check db_state because it
|
|
|
|
* can't be freed while we have a hold on the buffer.
|
|
|
|
*/
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
prefetch = db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
|
2024-04-04 01:04:26 +03:00
|
|
|
(flags & DB_RF_NOPREFETCH) == 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (flags & DB_RF_PARTIAL_FIRST)
|
|
|
|
db->db_partial_read = B_TRUE;
|
|
|
|
else if (!(flags & DB_RF_PARTIAL_MORE))
|
|
|
|
db->db_partial_read = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_state == DB_CACHED) {
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
2018-06-28 19:20:34 +03:00
|
|
|
* Ensure that this block's dnode has been decrypted if
|
|
|
|
* the caller has requested decrypted data.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
2018-06-28 19:20:34 +03:00
|
|
|
err = dbuf_read_verify_dnode_crypt(db, flags);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the arc buf is compressed or encrypted and the caller
|
|
|
|
* requested uncompressed data, we need to untransform it
|
|
|
|
* before returning. We also call arc_untransform() on any
|
|
|
|
* unauthenticated blocks, which will verify their MAC if
|
|
|
|
* the key is now available.
|
|
|
|
*/
|
|
|
|
if (err == 0 && db->db_buf != NULL &&
|
|
|
|
(flags & DB_RF_NO_DECRYPT) == 0 &&
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
(arc_is_encrypted(db->db_buf) ||
|
2018-06-28 19:20:34 +03:00
|
|
|
arc_is_unauthenticated(db->db_buf) ||
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_get_compression(db->db_buf) != ZIO_COMPRESS_OFF)) {
|
2022-11-29 20:49:02 +03:00
|
|
|
spa_t *spa = dn->dn_objset->os_spa;
|
2018-03-31 21:12:51 +03:00
|
|
|
zbookmark_phys_t zb;
|
|
|
|
|
|
|
|
SET_BOOKMARK(&zb, dmu_objset_id(db->db_objset),
|
|
|
|
db->db.db_object, db->db_level, db->db_blkid);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dbuf_fix_old_data(db, spa_syncing_txg(spa));
|
2018-03-31 21:12:51 +03:00
|
|
|
err = arc_untransform(db->db_buf, spa, &zb, B_FALSE);
|
2016-07-11 20:45:52 +03:00
|
|
|
dbuf_set_data(db, db->db_buf);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2019-07-08 23:18:50 +03:00
|
|
|
if (err == 0 && prefetch) {
|
|
|
|
dmu_zfetch(&dn->dn_zfetch, db->db_blkid, 1, B_TRUE,
|
Split dmu_zfetch() speculation and execution parts
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random. But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.
This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run(). The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued. It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.
For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients. Benefits of those ways
varied, but neither was perfect. With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.
While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock. Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.
Delete prefetch streams when they reach ends of files. It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.
Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache. First cache miss immediately fires all the prefetch
that would be done for the stream by that time. It saves some CPU
time if same files within DMU cache capacity are read over and over.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11652
2021-03-20 08:56:11 +03:00
|
|
|
B_FALSE, flags & DB_RF_HAVESTRUCT);
|
2019-07-08 23:18:50 +03:00
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2018-01-29 21:24:52 +03:00
|
|
|
DBUF_STAT_BUMP(hash_hits);
|
2023-03-10 22:59:53 +03:00
|
|
|
} else if (db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL) {
|
2017-04-14 00:35:00 +03:00
|
|
|
boolean_t need_wait = B_FALSE;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2019-07-08 23:18:50 +03:00
|
|
|
db_lock_type_t dblt = dmu_buf_lock_parent(db, RW_READER, FTAG);
|
|
|
|
|
2024-04-04 01:04:26 +03:00
|
|
|
if (pio == NULL && (db->db_state == DB_NOFILL ||
|
2023-03-10 22:59:53 +03:00
|
|
|
(db->db_blkptr != NULL && !BP_IS_HOLE(db->db_blkptr)))) {
|
2022-11-29 20:49:02 +03:00
|
|
|
spa_t *spa = dn->dn_objset->os_spa;
|
2024-04-04 01:04:26 +03:00
|
|
|
pio = zio_root(spa, NULL, NULL, ZIO_FLAG_CANFAIL);
|
2017-04-14 00:35:00 +03:00
|
|
|
need_wait = B_TRUE;
|
|
|
|
}
|
2024-04-04 01:04:26 +03:00
|
|
|
err = dbuf_read_impl(db, dn, pio, flags, dblt, FTAG);
|
2019-07-08 23:18:50 +03:00
|
|
|
/*
|
|
|
|
* dbuf_read_impl has dropped db_mtx and our parent's rwlock
|
|
|
|
* for us
|
|
|
|
*/
|
|
|
|
if (!err && prefetch) {
|
|
|
|
dmu_zfetch(&dn->dn_zfetch, db->db_blkid, 1, B_TRUE,
|
Split dmu_zfetch() speculation and execution parts
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random. But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.
This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run(). The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued. It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.
For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients. Benefits of those ways
varied, but neither was perfect. With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.
While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock. Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.
Delete prefetch streams when they reach ends of files. It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.
Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache. First cache miss immediately fires all the prefetch
that would be done for the stream by that time. It saves some CPU
time if same files within DMU cache capacity are read over and over.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11652
2021-03-20 08:56:11 +03:00
|
|
|
db->db_state != DB_CACHED,
|
2019-07-08 23:18:50 +03:00
|
|
|
flags & DB_RF_HAVESTRUCT);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2018-01-29 21:24:52 +03:00
|
|
|
DBUF_STAT_BUMP(hash_misses);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-01-15 23:23:40 +03:00
|
|
|
/*
|
|
|
|
* If we created a zio_root we must execute it to avoid
|
|
|
|
* leaking it, even if it isn't attached to any work due
|
|
|
|
* to an error in dbuf_read_impl().
|
|
|
|
*/
|
|
|
|
if (need_wait) {
|
|
|
|
if (err == 0)
|
2024-04-04 01:04:26 +03:00
|
|
|
err = zio_wait(pio);
|
2019-01-15 23:23:40 +03:00
|
|
|
else
|
2024-04-04 01:04:26 +03:00
|
|
|
(void) zio_wait(pio);
|
|
|
|
pio = NULL;
|
2019-01-15 23:23:40 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* Another reader came in while the dbuf was in flight
|
|
|
|
* between UNCACHED and CACHED. Either a writer will finish
|
|
|
|
* writing the buffer (sending the dbuf to CACHED) or the
|
|
|
|
* first reader's request will reach the read_done callback
|
|
|
|
* and send the dbuf to CACHED. Otherwise, a failure
|
|
|
|
* occurred and the dbuf went to UNCACHED.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2019-07-08 23:18:50 +03:00
|
|
|
if (prefetch) {
|
|
|
|
dmu_zfetch(&dn->dn_zfetch, db->db_blkid, 1, B_TRUE,
|
Split dmu_zfetch() speculation and execution parts
To make better predictions on parallel workloads dmu_zfetch() should
be called as early as possible to reduce possible request reordering.
In particular, it should be called before dmu_buf_hold_array_by_dnode()
calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
up multiple threads same time on completion, that can significantly
reorder the requests, making the stream look like random. But we
should not issue prefetch requests before the on-demand ones, since
they may get to the disks first despite the I/O scheduler, increasing
on-demand request latency.
This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
and dmu_zfetch_run(). The first can be executed as early as needed.
It only updates statistics and makes predictions without issuing any
I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be
called later when all on-demand I/Os are already issued. It even
tracks the activity of other concurrent threads, issuing the prefetch
only when _all_ on-demand requests are issued.
For many years it was a big problem for storage servers, handling
deeper request queues from their clients, having to either serialize
consequential reads to make ZFS prefetcher usable, or execute the
incoming requests as-is and get almost no prefetch from ZFS, relying
only on deep enough prefetch by the clients. Benefits of those ways
varied, but neither was perfect. With this patch deeper queue
sequential read benchmarks with CrystalDiskMark from Windows via
iSCSI to FreeBSD target show me much better throughput with almost
100% prefetcher hit rate, comparing to almost zero before.
While there, I also removed per-stream zs_lock as useless, completely
covered by parent zf_lock. Also I reused zs_blocks refcount to track
zf_stream linkage of the stream, since I believe previous zs_fetch ==
NULL check in dmu_zfetch_stream_done() was racy.
Delete prefetch streams when they reach ends of files. It saves up
to 1KB of RAM per file, plus reduces searches through the stream list.
Block data prefetch (speculation and indirect block prefetch is still
done since they are cheaper) if all dbufs of the stream are already
in DMU cache. First cache miss immediately fires all the prefetch
that would be done for the stream by that time. It saves some CPU
time if same files within DMU cache capacity are read over and over.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11652
2021-03-20 08:56:11 +03:00
|
|
|
B_TRUE, flags & DB_RF_HAVESTRUCT);
|
2019-07-08 23:18:50 +03:00
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2018-01-29 21:24:52 +03:00
|
|
|
DBUF_STAT_BUMP(hash_misses);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Skip the wait per the caller's request. */
|
2008-11-20 23:01:55 +03:00
|
|
|
if ((flags & DB_RF_NEVERWAIT) == 0) {
|
2020-02-27 03:09:17 +03:00
|
|
|
mutex_enter(&db->db_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
while (db->db_state == DB_READ ||
|
|
|
|
db->db_state == DB_FILL) {
|
|
|
|
ASSERT(db->db_state == DB_READ ||
|
|
|
|
(flags & DB_RF_HAVESTRUCT) == 0);
|
2014-09-17 10:53:02 +04:00
|
|
|
DTRACE_PROBE2(blocked__read, dmu_buf_impl_t *,
|
2024-04-04 01:04:26 +03:00
|
|
|
db, zio_t *, pio);
|
2008-11-20 23:01:55 +03:00
|
|
|
cv_wait(&db->db_changed, &db->db_mtx);
|
|
|
|
}
|
|
|
|
if (db->db_state == DB_UNCACHED)
|
2013-03-08 22:41:28 +04:00
|
|
|
err = SET_ERROR(EIO);
|
2020-02-27 03:09:17 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-04-04 01:04:26 +03:00
|
|
|
if (pio && err != 0) {
|
|
|
|
zio_t *zio = zio_null(pio, pio->io_spa, NULL, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
zio->io_error = err;
|
|
|
|
zio_nowait(zio);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_noread(dmu_buf_impl_t *db)
|
|
|
|
{
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
while (db->db_state == DB_READ || db->db_state == DB_FILL)
|
|
|
|
cv_wait(&db->db_changed, &db->db_mtx);
|
|
|
|
if (db->db_state == DB_UNCACHED) {
|
|
|
|
ASSERT(db->db_buf == NULL);
|
|
|
|
ASSERT(db->db.db_data == NULL);
|
2020-02-18 22:21:37 +03:00
|
|
|
dbuf_set_data(db, dbuf_alloc_arcbuf(db));
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_state = DB_FILL;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "assigning filled buffer");
|
2008-12-03 23:09:06 +03:00
|
|
|
} else if (db->db_state == DB_NOFILL) {
|
2015-04-02 06:44:32 +03:00
|
|
|
dbuf_clear_data(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3U(db->db_state, ==, DB_CACHED);
|
|
|
|
}
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dbuf_unoverride(dbuf_dirty_record_t *dr)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
2010-05-29 00:45:14 +04:00
|
|
|
blkptr_t *bp = &dr->dt.dl.dr_overridden_by;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg = dr->dr_txg;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2017-05-24 13:34:56 +03:00
|
|
|
/*
|
|
|
|
* This assert is valid because dmu_sync() expects to be called by
|
|
|
|
* a zilog's get_data while holding a range lock. This call only
|
|
|
|
* comes from dbuf_dirty() callers who must also hold a range lock.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(dr->dt.dl.dr_override_state != DR_IN_DMU_SYNC);
|
|
|
|
ASSERT(db->db_level == 0);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID ||
|
2008-11-20 23:01:55 +03:00
|
|
|
dr->dt.dl.dr_override_state == DR_NOT_OVERRIDDEN)
|
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_data_pending != dr);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* free this block */
|
2013-12-09 22:37:51 +04:00
|
|
|
if (!BP_IS_HOLE(bp) && !dr->dt.dl.dr_nopwrite)
|
|
|
|
zio_free(db->db_objset->os_spa, txg, bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2023-12-12 23:59:24 +03:00
|
|
|
if (dr->dt.dl.dr_brtwrite) {
|
|
|
|
ASSERT0(dr->dt.dl.dr_data);
|
|
|
|
dr->dt.dl.dr_data = db->db_buf;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
|
2013-05-10 23:47:54 +04:00
|
|
|
dr->dt.dl.dr_nopwrite = B_FALSE;
|
2023-04-30 12:47:09 +03:00
|
|
|
dr->dt.dl.dr_brtwrite = B_FALSE;
|
2018-04-17 21:06:54 +03:00
|
|
|
dr->dt.dl.dr_has_raw_params = B_FALSE;
|
2013-05-10 23:47:54 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Release the already-written buffer, so we leave it in
|
|
|
|
* a consistent dirty state. Note that all callers are
|
|
|
|
* modifying the buffer, so they will immediately do
|
|
|
|
* another (redundant) arc_release(). Therefore, leave
|
|
|
|
* the buf thawed to save the effort of freezing &
|
|
|
|
* immediately re-thawing it.
|
|
|
|
*/
|
2023-12-12 23:59:24 +03:00
|
|
|
if (dr->dt.dl.dr_data)
|
2023-03-10 22:59:53 +03:00
|
|
|
arc_release(dr->dt.dl.dr_data, db);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Evict (if its unreferenced) or clear (if its referenced) any level-0
|
|
|
|
* data blocks in the free range, so that any future readers will find
|
2013-12-09 22:37:51 +04:00
|
|
|
* empty blocks.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2015-04-03 06:14:28 +03:00
|
|
|
dbuf_free_range(dnode_t *dn, uint64_t start_blkid, uint64_t end_blkid,
|
|
|
|
dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-04-02 06:44:32 +03:00
|
|
|
dmu_buf_impl_t *db_search;
|
|
|
|
dmu_buf_impl_t *db, *db_next;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg = tx->tx_txg;
|
2015-04-03 06:14:28 +03:00
|
|
|
avl_index_t where;
|
2020-02-12 00:12:41 +03:00
|
|
|
dbuf_dirty_record_t *dr;
|
2015-04-03 06:14:28 +03:00
|
|
|
|
2017-01-27 02:15:48 +03:00
|
|
|
if (end_blkid > dn->dn_maxblkid &&
|
|
|
|
!(start_blkid == DMU_SPILL_BLKID || end_blkid == DMU_SPILL_BLKID))
|
2015-04-03 06:14:28 +03:00
|
|
|
end_blkid = dn->dn_maxblkid;
|
2021-06-23 07:53:45 +03:00
|
|
|
dprintf_dnode(dn, "start=%llu end=%llu\n", (u_longlong_t)start_blkid,
|
|
|
|
(u_longlong_t)end_blkid);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
db_search = kmem_alloc(sizeof (dmu_buf_impl_t), KM_SLEEP);
|
2015-04-03 06:14:28 +03:00
|
|
|
db_search->db_level = 0;
|
|
|
|
db_search->db_blkid = start_blkid;
|
2015-04-01 18:10:58 +03:00
|
|
|
db_search->db_state = DB_SEARCH;
|
2013-07-29 22:58:53 +04:00
|
|
|
|
2013-08-21 08:11:52 +04:00
|
|
|
mutex_enter(&dn->dn_dbufs_mtx);
|
2015-04-03 06:14:28 +03:00
|
|
|
db = avl_find(&dn->dn_dbufs, db_search, &where);
|
|
|
|
ASSERT3P(db, ==, NULL);
|
2017-01-27 02:15:48 +03:00
|
|
|
|
2015-04-03 06:14:28 +03:00
|
|
|
db = avl_nearest(&dn->dn_dbufs, where, AVL_AFTER);
|
|
|
|
|
|
|
|
for (; db != NULL; db = db_next) {
|
|
|
|
db_next = AVL_NEXT(&dn->dn_dbufs, db);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2015-04-03 06:14:28 +03:00
|
|
|
if (db->db_level != 0 || db->db_blkid > end_blkid) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ASSERT3U(db->db_blkid, >=, start_blkid);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* found a level 0 buffer in the range */
|
2013-09-04 16:00:57 +04:00
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
if (dbuf_undirty(db, tx)) {
|
|
|
|
/* mutex has been dropped and dbuf destroyed */
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (db->db_state == DB_UNCACHED ||
|
2008-12-03 23:09:06 +03:00
|
|
|
db->db_state == DB_NOFILL ||
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_state == DB_EVICTING) {
|
|
|
|
ASSERT(db->db.db_data == NULL);
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (db->db_state == DB_READ || db->db_state == DB_FILL) {
|
|
|
|
/* will be handled in dbuf_read_done or dbuf_rele */
|
|
|
|
db->db_freed_in_flight = TRUE;
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
continue;
|
|
|
|
}
|
2018-10-01 20:42:05 +03:00
|
|
|
if (zfs_refcount_count(&db->db_holds) == 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_destroy(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* The dbuf is referenced */
|
|
|
|
|
2020-02-12 00:12:41 +03:00
|
|
|
dr = list_head(&db->db_dirty_records);
|
|
|
|
if (dr != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dr->dr_txg == txg) {
|
|
|
|
/*
|
|
|
|
* This buffer is "in-use", re-adjust the file
|
|
|
|
* size to reflect that this buffer may
|
|
|
|
* contain new data when we sync.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid != DMU_SPILL_BLKID &&
|
|
|
|
db->db_blkid > dn->dn_maxblkid)
|
2008-11-20 23:01:55 +03:00
|
|
|
dn->dn_maxblkid = db->db_blkid;
|
|
|
|
dbuf_unoverride(dr);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This dbuf is not dirty in the open context.
|
|
|
|
* Either uncache it (if its not referenced in
|
|
|
|
* the open context) or reset its contents to
|
|
|
|
* empty.
|
|
|
|
*/
|
|
|
|
dbuf_fix_old_data(db, txg);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* clear the contents if its cached */
|
|
|
|
if (db->db_state == DB_CACHED) {
|
|
|
|
ASSERT(db->db.db_data != NULL);
|
|
|
|
arc_release(db->db_buf, db);
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_enter(&db->db_rwlock, RW_WRITER);
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(db->db.db_data, 0, db->db.db_size);
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_exit(&db->db_rwlock);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_freeze(db->db_buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
}
|
2015-04-03 06:14:28 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&dn->dn_dbufs_mtx);
|
2021-11-30 21:32:38 +03:00
|
|
|
kmem_free(db_search, sizeof (dmu_buf_impl_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dbuf_new_size(dmu_buf_impl_t *db, int size, dmu_tx_t *tx)
|
|
|
|
{
|
2020-02-27 03:09:17 +03:00
|
|
|
arc_buf_t *buf, *old_buf;
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr;
|
2008-11-20 23:01:55 +03:00
|
|
|
int osize = db->db.db_size;
|
|
|
|
arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_t *dn;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* XXX we should be doing a dbuf_read, checking the return
|
|
|
|
* value and returning that up to our callers
|
|
|
|
*/
|
2013-12-09 22:37:51 +04:00
|
|
|
dmu_buf_will_dirty(&db->db, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* create the data buffer for the new block */
|
2016-07-11 20:45:52 +03:00
|
|
|
buf = arc_alloc_buf(dn->dn_objset->os_spa, db, type, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* copy old block data to the new block */
|
2020-02-27 03:09:17 +03:00
|
|
|
old_buf = db->db_buf;
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(buf->b_data, old_buf->b_data, MIN(osize, size));
|
2008-11-20 23:01:55 +03:00
|
|
|
/* zero the remainder */
|
|
|
|
if (size > osize)
|
2022-02-25 16:26:54 +03:00
|
|
|
memset((uint8_t *)buf->b_data + osize, 0, size - osize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
dbuf_set_data(db, buf);
|
2020-02-27 03:09:17 +03:00
|
|
|
arc_buf_destroy(old_buf, db);
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db.db_size = size;
|
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
dr = list_head(&db->db_dirty_records);
|
2020-02-12 00:12:41 +03:00
|
|
|
/* dirty record added by dmu_buf_will_dirty() */
|
|
|
|
VERIFY(dr != NULL);
|
2020-02-05 22:07:19 +03:00
|
|
|
if (db->db_level == 0)
|
|
|
|
dr->dt.dl.dr_data = buf;
|
|
|
|
ASSERT3U(dr->dr_txg, ==, tx->tx_txg);
|
|
|
|
ASSERT3U(dr->dr_accounted, ==, osize);
|
|
|
|
dr->dr_accounted = size;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).
OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804
Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
applied to OpenZFS.
2017-03-07 20:51:59 +03:00
|
|
|
dmu_objset_willuse_space(dn->dn_objset, size - osize, tx);
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
|
|
|
dbuf_release_bp(dmu_buf_impl_t *db)
|
|
|
|
{
|
2019-12-05 23:37:00 +03:00
|
|
|
objset_t *os __maybe_unused = db->db_objset;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ASSERT(dsl_pool_sync_context(dmu_objset_pool(os)));
|
|
|
|
ASSERT(arc_released(os->os_phys_buf) ||
|
|
|
|
list_link_active(&os->os_dsl_dataset->ds_synced_link));
|
|
|
|
ASSERT(db->db_parent == NULL || arc_released(db->db_parent->db_buf));
|
|
|
|
|
2013-07-03 00:26:24 +04:00
|
|
|
(void) arc_release(db->db_buf, db);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2015-11-04 23:37:33 +03:00
|
|
|
/*
|
|
|
|
* We already have a dirty record for this TXG, and we are being
|
|
|
|
* dirtied again.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_redirty(dbuf_dirty_record_t *dr)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
|
|
|
if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID) {
|
|
|
|
/*
|
|
|
|
* If this buffer has already been written out,
|
|
|
|
* we now need to reset its state.
|
|
|
|
*/
|
|
|
|
dbuf_unoverride(dr);
|
|
|
|
if (db->db.db_object != DMU_META_DNODE_OBJECT &&
|
|
|
|
db->db_state != DB_NOFILL) {
|
|
|
|
/* Already released on initial dirty, so just thaw. */
|
|
|
|
ASSERT(arc_released(db->db_buf));
|
|
|
|
arc_buf_thaw(db->db_buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dbuf_dirty_record_t *
|
|
|
|
dbuf_dirty_lightweight(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
rw_enter(&dn->dn_struct_rwlock, RW_READER);
|
|
|
|
IMPLY(dn->dn_objset->os_raw_receive, dn->dn_maxblkid >= blkid);
|
|
|
|
dnode_new_blkid(dn, blkid, tx, B_TRUE, B_FALSE);
|
|
|
|
ASSERT(dn->dn_maxblkid >= blkid);
|
|
|
|
|
|
|
|
dbuf_dirty_record_t *dr = kmem_zalloc(sizeof (*dr), KM_SLEEP);
|
|
|
|
list_link_init(&dr->dr_dirty_node);
|
|
|
|
list_link_init(&dr->dr_dbuf_node);
|
|
|
|
dr->dr_dnode = dn;
|
|
|
|
dr->dr_txg = tx->tx_txg;
|
|
|
|
dr->dt.dll.dr_blkid = blkid;
|
|
|
|
dr->dr_accounted = dn->dn_datablksz;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There should not be any dbuf for the block that we're dirtying.
|
|
|
|
* Otherwise the buffer contents could be inconsistent between the
|
|
|
|
* dbuf and the lightweight dirty record.
|
|
|
|
*/
|
2022-12-14 04:29:21 +03:00
|
|
|
ASSERT3P(NULL, ==, dbuf_find(dn->dn_objset, dn->dn_object, 0, blkid,
|
|
|
|
NULL));
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
int txgoff = tx->tx_txg & TXG_MASK;
|
|
|
|
if (dn->dn_free_ranges[txgoff] != NULL) {
|
|
|
|
range_tree_clear(dn->dn_free_ranges[txgoff], blkid, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dn->dn_nlevels == 1) {
|
|
|
|
ASSERT3U(blkid, <, dn->dn_nblkptr);
|
|
|
|
list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
rw_exit(&dn->dn_struct_rwlock);
|
|
|
|
dnode_setdirty(dn, tx);
|
|
|
|
} else {
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
|
|
|
|
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
|
|
|
|
dmu_buf_impl_t *parent_db = dbuf_hold_level(dn,
|
|
|
|
1, blkid >> epbs, FTAG);
|
|
|
|
rw_exit(&dn->dn_struct_rwlock);
|
|
|
|
if (parent_db == NULL) {
|
|
|
|
kmem_free(dr, sizeof (*dr));
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
int err = dbuf_read(parent_db, NULL,
|
|
|
|
(DB_RF_NOPREFETCH | DB_RF_CANFAIL));
|
|
|
|
if (err != 0) {
|
|
|
|
dbuf_rele(parent_db, FTAG);
|
|
|
|
kmem_free(dr, sizeof (*dr));
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
dbuf_dirty_record_t *parent_dr = dbuf_dirty(parent_db, tx);
|
|
|
|
dbuf_rele(parent_db, FTAG);
|
|
|
|
mutex_enter(&parent_dr->dt.di.dr_mtx);
|
|
|
|
ASSERT3U(parent_dr->dr_txg, ==, tx->tx_txg);
|
|
|
|
list_insert_tail(&parent_dr->dt.di.dr_children, dr);
|
|
|
|
mutex_exit(&parent_dr->dt.di.dr_mtx);
|
|
|
|
dr->dr_parent = parent_dr;
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_objset_willuse_space(dn->dn_objset, dr->dr_accounted, tx);
|
|
|
|
|
|
|
|
return (dr);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_dirty_record_t *
|
|
|
|
dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
|
|
|
|
{
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_t *dn;
|
|
|
|
objset_t *os;
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr, *dr_next, *dr_head;
|
2008-11-20 23:01:55 +03:00
|
|
|
int txgoff = tx->tx_txg & TXG_MASK;
|
2019-07-08 23:18:50 +03:00
|
|
|
boolean_t drop_struct_rwlock = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(tx->tx_txg != 0);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
DMU_TX_DIRTY_BUF(tx, db);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Shouldn't dirty a regular buffer in syncing context. Private
|
|
|
|
* objects may be dirtied in syncing context, but only if they
|
|
|
|
* were already pre-dirtied in open context.
|
|
|
|
*/
|
2020-07-26 06:07:44 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2017-01-27 22:43:42 +03:00
|
|
|
if (dn->dn_objset->os_dsl_dataset != NULL) {
|
|
|
|
rrw_enter(&dn->dn_objset->os_dsl_dataset->ds_bp_rwlock,
|
|
|
|
RW_READER, FTAG);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!dmu_tx_is_syncing(tx) ||
|
|
|
|
BP_IS_HOLE(dn->dn_objset->os_rootbp) ||
|
2009-07-03 02:44:48 +04:00
|
|
|
DMU_OBJECT_IS_SPECIAL(dn->dn_object) ||
|
|
|
|
dn->dn_objset->os_dsl_dataset == NULL);
|
2017-01-27 22:43:42 +03:00
|
|
|
if (dn->dn_objset->os_dsl_dataset != NULL)
|
|
|
|
rrw_exit(&dn->dn_objset->os_dsl_dataset->ds_bp_rwlock, FTAG);
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We make this assert for private objects as well, but after we
|
|
|
|
* check if we're already dirty. They are allowed to re-dirty
|
|
|
|
* in syncing context.
|
|
|
|
*/
|
|
|
|
ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT ||
|
|
|
|
dn->dn_dirtyctx == DN_UNDIRTIED || dn->dn_dirtyctx ==
|
|
|
|
(dmu_tx_is_syncing(tx) ? DN_DIRTY_SYNC : DN_DIRTY_OPEN));
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
/*
|
|
|
|
* XXX make this true for indirects too? The problem is that
|
|
|
|
* transactions created with dmu_tx_create_assigned() from
|
|
|
|
* syncing context don't bother holding ahead.
|
|
|
|
*/
|
|
|
|
ASSERT(db->db_level != 0 ||
|
2008-12-03 23:09:06 +03:00
|
|
|
db->db_state == DB_CACHED || db->db_state == DB_FILL ||
|
|
|
|
db->db_state == DB_NOFILL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
2020-02-27 03:09:17 +03:00
|
|
|
dnode_set_dirtyctx(dn, tx, db);
|
2018-04-10 21:15:05 +03:00
|
|
|
if (tx->tx_txg > dn->dn_dirty_txg)
|
|
|
|
dn->dn_dirty_txg = tx->tx_txg;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_SPILL_BLKID)
|
|
|
|
dn->dn_have_spill = B_TRUE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If this buffer is already dirty, we're done.
|
|
|
|
*/
|
2020-02-05 22:07:19 +03:00
|
|
|
dr_head = list_head(&db->db_dirty_records);
|
|
|
|
ASSERT(dr_head == NULL || dr_head->dr_txg <= tx->tx_txg ||
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db.db_object == DMU_META_DNODE_OBJECT);
|
2020-02-05 22:07:19 +03:00
|
|
|
dr_next = dbuf_find_dirty_lte(db, tx->tx_txg);
|
|
|
|
if (dr_next && dr_next->dr_txg == tx->tx_txg) {
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_redirty(dr_next);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2020-02-05 22:07:19 +03:00
|
|
|
return (dr_next);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only valid if not already dirty.
|
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(dn->dn_object == 0 ||
|
|
|
|
dn->dn_dirtyctx == DN_UNDIRTIED || dn->dn_dirtyctx ==
|
2008-11-20 23:01:55 +03:00
|
|
|
(dmu_tx_is_syncing(tx) ? DN_DIRTY_SYNC : DN_DIRTY_OPEN));
|
|
|
|
|
|
|
|
ASSERT3U(dn->dn_nlevels, >, db->db_level);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We should only be dirtying in syncing context if it's the
|
2009-07-03 02:44:48 +04:00
|
|
|
* mos or we're initializing the os or it's a special object.
|
|
|
|
* However, we are allowed to dirty in syncing context provided
|
|
|
|
* we already dirtied it in open context. Hence we must make
|
|
|
|
* this assertion only if we're not already dirty.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
os = dn->dn_objset;
|
2017-04-07 23:50:18 +03:00
|
|
|
VERIFY3U(tx->tx_txg, <=, spa_final_dirty_txg(os->os_spa));
|
2020-07-26 06:07:44 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2017-01-27 22:43:42 +03:00
|
|
|
if (dn->dn_objset->os_dsl_dataset != NULL)
|
|
|
|
rrw_enter(&os->os_dsl_dataset->ds_bp_rwlock, RW_READER, FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(!dmu_tx_is_syncing(tx) || DMU_OBJECT_IS_SPECIAL(dn->dn_object) ||
|
|
|
|
os->os_dsl_dataset == NULL || BP_IS_HOLE(os->os_rootbp));
|
2017-01-27 22:43:42 +03:00
|
|
|
if (dn->dn_objset->os_dsl_dataset != NULL)
|
|
|
|
rrw_exit(&os->os_dsl_dataset->ds_bp_rwlock, FTAG);
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db.db_size != 0);
|
|
|
|
|
|
|
|
dprintf_dbuf(db, "size=%llx\n", (u_longlong_t)db->db.db_size);
|
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
if (db->db_blkid != DMU_BONUS_BLKID && db->db_state != DB_NOFILL) {
|
OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).
OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804
Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
applied to OpenZFS.
2017-03-07 20:51:59 +03:00
|
|
|
dmu_objset_willuse_space(os, db->db.db_size, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this buffer is dirty in an old transaction group we need
|
|
|
|
* to make a copy of it so that the changes we make in this
|
|
|
|
* transaction group won't leak out when we sync the older txg.
|
|
|
|
*/
|
2014-11-21 03:09:39 +03:00
|
|
|
dr = kmem_zalloc(sizeof (dbuf_dirty_record_t), KM_SLEEP);
|
2010-08-26 21:26:44 +04:00
|
|
|
list_link_init(&dr->dr_dirty_node);
|
2020-02-05 22:07:19 +03:00
|
|
|
list_link_init(&dr->dr_dbuf_node);
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dr->dr_dnode = dn;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_level == 0) {
|
|
|
|
void *data_old = db->db_buf;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (db->db_state != DB_NOFILL) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
2008-12-03 23:09:06 +03:00
|
|
|
dbuf_fix_old_data(db, tx->tx_txg);
|
|
|
|
data_old = db->db.db_data;
|
|
|
|
} else if (db->db.db_object != DMU_META_DNODE_OBJECT) {
|
|
|
|
/*
|
|
|
|
* Release the data buffer from the cache so
|
|
|
|
* that we can modify it without impacting
|
|
|
|
* possible other users of this cached data
|
|
|
|
* block. Note that indirect blocks and
|
|
|
|
* private objects are not released until the
|
|
|
|
* syncing state (since they are only modified
|
|
|
|
* then).
|
|
|
|
*/
|
|
|
|
arc_release(db->db_buf, db);
|
|
|
|
dbuf_fix_old_data(db, tx->tx_txg);
|
|
|
|
data_old = db->db_buf;
|
|
|
|
}
|
|
|
|
ASSERT(data_old != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
dr->dt.dl.dr_data = data_old;
|
|
|
|
} else {
|
Identify locks flagged by lockdep
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.
This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark
such mutexes when they are initialized. This mutex type causes
attempts to take or release those locks to be wrapped in lockdep_off()
and lockdep_on() calls to silence the dependency checker and allow the
use of lock_stats to examine contention.
For RW locks, it uses an analogous lock type, RW_NOLOCKDEP.
The goal is that these locks are ultimately changed back to type
MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect
their relationship (e.g. z_name_lock below) or any real problem with the
lock dependencies are fixed.
Some of the affected locks are:
tc_open_lock:
=============
This is an array of locks, all with same name, which txg_quiesce must
take all of in order to move txg to next state. All default to the same
lockdep class, and so to lockdep appears recursive.
zp->z_name_lock:
================
In zfs_rmdir,
dzp = znode for the directory (input to zfs_dirent_lock)
zp = znode for the entry being removed (output of zfs_dirent_lock)
zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp
zfs_rmdir() takes z_name_lock in zp
Since both dzp and zp are type znode_t, the locks have the same default
class, and lockdep considers it a possible recursive lock attempt.
l->l_rwlock:
============
zap_expand_leaf() sometimes creates two new zap leaf structures, via
these call paths:
zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open()
zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf()
Because both zap_leaf_open() and zap_create_leaf() initialize
l->l_rwlock in their (separate) leaf structures, the lockdep class is
the same, and the linux kernel believes these might both be the same
lock, and emits a possible recursive lock warning.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3895
2015-10-15 23:08:27 +03:00
|
|
|
mutex_init(&dr->dt.di.dr_mtx, NULL, MUTEX_NOLOCKDEP, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
list_create(&dr->dt.di.dr_children,
|
|
|
|
sizeof (dbuf_dirty_record_t),
|
|
|
|
offsetof(dbuf_dirty_record_t, dr_dirty_node));
|
|
|
|
}
|
2023-03-10 22:59:53 +03:00
|
|
|
if (db->db_blkid != DMU_BONUS_BLKID && db->db_state != DB_NOFILL) {
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
dr->dr_accounted = db->db.db_size;
|
2023-03-10 22:59:53 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
dr->dr_dbuf = db;
|
|
|
|
dr->dr_txg = tx->tx_txg;
|
2020-02-05 22:07:19 +03:00
|
|
|
list_insert_before(&db->db_dirty_records, dr_next, dr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We could have been freed_in_flight between the dbuf_noread
|
|
|
|
* and dbuf_dirty. We win, as though the dbuf_noread() had
|
|
|
|
* happened after the free.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
|
|
|
|
db->db_blkid != DMU_SPILL_BLKID) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&dn->dn_mtx);
|
2014-04-16 07:40:22 +04:00
|
|
|
if (dn->dn_free_ranges[txgoff] != NULL) {
|
|
|
|
range_tree_clear(dn->dn_free_ranges[txgoff],
|
|
|
|
db->db_blkid, 1);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
db->db_freed_in_flight = FALSE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This buffer is now part of this txg
|
|
|
|
*/
|
|
|
|
dbuf_add_ref(db, (void *)(uintptr_t)tx->tx_txg);
|
|
|
|
db->db_dirtycnt += 1;
|
|
|
|
ASSERT3U(db->db_dirtycnt, <=, 3);
|
|
|
|
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID ||
|
|
|
|
db->db_blkid == DMU_SPILL_BLKID) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
ASSERT(!list_link_active(&dr->dr_dirty_node));
|
|
|
|
list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
dnode_setdirty(dn, tx);
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (dr);
|
2016-08-29 21:40:16 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
|
|
|
|
rw_enter(&dn->dn_struct_rwlock, RW_READER);
|
2019-07-08 23:18:50 +03:00
|
|
|
drop_struct_rwlock = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are overwriting a dedup BP, then unless it is snapshotted,
|
|
|
|
* when we get to syncing context we will need to decrement its
|
|
|
|
* refcount in the DDT. Prefetch the relevant DDT block so that
|
|
|
|
* syncing context won't have to wait for the i/o.
|
|
|
|
*/
|
|
|
|
if (db->db_blkptr != NULL) {
|
|
|
|
db_lock_type_t dblt = dmu_buf_lock_parent(db, RW_READER, FTAG);
|
|
|
|
ddt_prefetch(os->os_spa, db->db_blkptr);
|
|
|
|
dmu_buf_unlock_parent(db, dblt, FTAG);
|
2016-08-29 21:40:16 +03:00
|
|
|
}
|
|
|
|
|
2017-03-21 01:38:11 +03:00
|
|
|
/*
|
|
|
|
* We need to hold the dn_struct_rwlock to make this assertion,
|
|
|
|
* because it protects dn_phys / dn_next_nlevels from changing.
|
|
|
|
*/
|
|
|
|
ASSERT((dn->dn_phys->dn_nlevels == 0 && db->db_level == 0) ||
|
|
|
|
dn->dn_phys->dn_nlevels > db->db_level ||
|
|
|
|
dn->dn_next_nlevels[txgoff] > db->db_level ||
|
|
|
|
dn->dn_next_nlevels[(tx->tx_txg-1) & TXG_MASK] > db->db_level ||
|
|
|
|
dn->dn_next_nlevels[(tx->tx_txg-2) & TXG_MASK] > db->db_level);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (db->db_level == 0) {
|
2018-06-28 19:20:34 +03:00
|
|
|
ASSERT(!db->db_objset->os_raw_receive ||
|
|
|
|
dn->dn_maxblkid >= db->db_blkid);
|
2019-03-13 20:52:01 +03:00
|
|
|
dnode_new_blkid(dn, db->db_blkid, tx,
|
2019-07-08 23:18:50 +03:00
|
|
|
drop_struct_rwlock, B_FALSE);
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(dn->dn_maxblkid >= db->db_blkid);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_level+1 < dn->dn_nlevels) {
|
|
|
|
dmu_buf_impl_t *parent = db->db_parent;
|
|
|
|
dbuf_dirty_record_t *di;
|
|
|
|
int parent_held = FALSE;
|
|
|
|
|
|
|
|
if (db->db_parent == NULL || db->db_parent == dn->dn_dbuf) {
|
|
|
|
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
|
2019-07-08 23:18:50 +03:00
|
|
|
parent = dbuf_hold_level(dn, db->db_level + 1,
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_blkid >> epbs, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(parent != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
parent_held = TRUE;
|
|
|
|
}
|
2019-07-08 23:18:50 +03:00
|
|
|
if (drop_struct_rwlock)
|
2008-11-20 23:01:55 +03:00
|
|
|
rw_exit(&dn->dn_struct_rwlock);
|
2019-07-08 23:18:50 +03:00
|
|
|
ASSERT3U(db->db_level + 1, ==, parent->db_level);
|
2008-11-20 23:01:55 +03:00
|
|
|
di = dbuf_dirty(parent, tx);
|
|
|
|
if (parent_held)
|
|
|
|
dbuf_rele(parent, FTAG);
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* Since we've dropped the mutex, it's possible that
|
|
|
|
* dbuf_undirty() might have changed this out from under us.
|
|
|
|
*/
|
2020-02-05 22:07:19 +03:00
|
|
|
if (list_head(&db->db_dirty_records) == dr ||
|
2008-11-20 23:01:55 +03:00
|
|
|
dn->dn_object == DMU_META_DNODE_OBJECT) {
|
|
|
|
mutex_enter(&di->dt.di.dr_mtx);
|
|
|
|
ASSERT3U(di->dr_txg, ==, tx->tx_txg);
|
|
|
|
ASSERT(!list_link_active(&dr->dr_dirty_node));
|
|
|
|
list_insert_tail(&di->dt.di.dr_children, dr);
|
|
|
|
mutex_exit(&di->dt.di.dr_mtx);
|
|
|
|
dr->dr_parent = di;
|
|
|
|
}
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
} else {
|
2019-07-08 23:18:50 +03:00
|
|
|
ASSERT(db->db_level + 1 == dn->dn_nlevels);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_blkid < dn->dn_nblkptr);
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT(db->db_parent == NULL || db->db_parent == dn->dn_dbuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
ASSERT(!list_link_active(&dr->dr_dirty_node));
|
|
|
|
list_insert_tail(&dn->dn_dirty_records[txgoff], dr);
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
2019-07-08 23:18:50 +03:00
|
|
|
if (drop_struct_rwlock)
|
2008-11-20 23:01:55 +03:00
|
|
|
rw_exit(&dn->dn_struct_rwlock);
|
|
|
|
}
|
|
|
|
|
|
|
|
dnode_setdirty(dn, tx);
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (dr);
|
|
|
|
}
|
|
|
|
|
2020-02-08 01:22:29 +03:00
|
|
|
static void
|
|
|
|
dbuf_undirty_bonus(dbuf_dirty_record_t *dr)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
|
|
|
|
|
|
|
if (dr->dt.dl.dr_data != db->db.db_data) {
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
struct dnode *dn = dr->dr_dnode;
|
2020-02-08 01:22:29 +03:00
|
|
|
int max_bonuslen = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots);
|
|
|
|
|
|
|
|
kmem_free(dr->dt.dl.dr_data, max_bonuslen);
|
|
|
|
arc_space_return(max_bonuslen, ARC_SPACE_BONUS);
|
|
|
|
}
|
|
|
|
db->db_data_pending = NULL;
|
|
|
|
ASSERT(list_next(&db->db_dirty_records, dr) == NULL);
|
|
|
|
list_remove(&db->db_dirty_records, dr);
|
|
|
|
if (dr->dr_dbuf->db_level != 0) {
|
|
|
|
mutex_destroy(&dr->dt.di.dr_mtx);
|
|
|
|
list_destroy(&dr->dt.di.dr_children);
|
|
|
|
}
|
|
|
|
kmem_free(dr, sizeof (dbuf_dirty_record_t));
|
|
|
|
ASSERT3U(db->db_dirtycnt, >, 0);
|
|
|
|
db->db_dirtycnt -= 1;
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/*
|
2013-06-11 21:12:34 +04:00
|
|
|
* Undirty a buffer in the transaction group referenced by the given
|
|
|
|
* transaction. Return whether this evicted the dbuf.
|
2013-09-04 16:00:57 +04:00
|
|
|
*/
|
2023-03-24 20:18:35 +03:00
|
|
|
boolean_t
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
uint64_t txg = tx->tx_txg;
|
2023-03-10 22:59:53 +03:00
|
|
|
boolean_t brtwrite;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(txg != 0);
|
2015-07-02 19:23:20 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Due to our use of dn_nlevels below, this can only be called
|
|
|
|
* in open context, unless we are operating on the MOS.
|
|
|
|
* From syncing context, dn_nlevels may be different from the
|
|
|
|
* dn_nlevels used when dbuf was dirtied.
|
|
|
|
*/
|
|
|
|
ASSERT(db->db_objset ==
|
|
|
|
dmu_objset_pool(db->db_objset)->dp_meta_objset ||
|
|
|
|
txg != spa_syncing_txg(dmu_objset_spa(db->db_objset)));
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT0(db->db_level);
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this buffer is not dirty, we're done.
|
|
|
|
*/
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dbuf_dirty_record_t *dr = dbuf_find_dirty_eq(db, txg);
|
2020-02-05 22:07:19 +03:00
|
|
|
if (dr == NULL)
|
2013-09-04 16:00:57 +04:00
|
|
|
return (B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(dr->dr_dbuf == db);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
brtwrite = dr->dt.dl.dr_brtwrite;
|
|
|
|
if (brtwrite) {
|
|
|
|
/*
|
|
|
|
* We are freeing a block that we cloned in the same
|
|
|
|
* transaction group.
|
|
|
|
*/
|
|
|
|
brt_pending_remove(dmu_objset_spa(db->db_objset),
|
|
|
|
&dr->dt.dl.dr_overridden_by, tx);
|
|
|
|
}
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dprintf_dbuf(db, "size=%llx\n", (u_longlong_t)db->db.db_size);
|
|
|
|
|
|
|
|
ASSERT(db->db.db_size != 0);
|
|
|
|
|
2015-07-02 19:23:20 +03:00
|
|
|
dsl_pool_undirty_space(dmu_objset_pool(dn->dn_objset),
|
|
|
|
dr->dr_accounted, txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
list_remove(&db->db_dirty_records, dr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.
At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);
These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)
Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.
Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().
References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-07-26 22:37:06 +04:00
|
|
|
/*
|
|
|
|
* Note that there are three places in dbuf_dirty()
|
|
|
|
* where this dirty record may be put on a list.
|
|
|
|
* Make sure to do a list_remove corresponding to
|
|
|
|
* every one of those list_insert calls.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dr->dr_parent) {
|
|
|
|
mutex_enter(&dr->dr_parent->dt.di.dr_mtx);
|
|
|
|
list_remove(&dr->dr_parent->dt.di.dr_children, dr);
|
|
|
|
mutex_exit(&dr->dr_parent->dt.di.dr_mtx);
|
Illumos #764: panic in zfs:dbuf_sync_list
Hypothesis about what's going on here.
At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);
These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)
Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.
Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().
References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-07-26 22:37:06 +04:00
|
|
|
} else if (db->db_blkid == DMU_SPILL_BLKID ||
|
2015-07-02 19:23:20 +03:00
|
|
|
db->db_level + 1 == dn->dn_nlevels) {
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(db->db_blkptr == NULL || db->db_parent == dn->dn_dbuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
list_remove(&dn->dn_dirty_records[txg & TXG_MASK], dr);
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
}
|
|
|
|
|
2023-03-10 22:59:53 +03:00
|
|
|
if (db->db_state != DB_NOFILL && !brtwrite) {
|
2013-09-04 16:00:57 +04:00
|
|
|
dbuf_unoverride(dr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(db->db_buf != NULL);
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT(dr->dt.dl.dr_data != NULL);
|
|
|
|
if (dr->dt.dl.dr_data != db->db_buf)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(dr->dt.dl.dr_data, db);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-04-01 16:49:14 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(dr, sizeof (dbuf_dirty_record_t));
|
|
|
|
|
|
|
|
ASSERT(db->db_dirtycnt > 0);
|
|
|
|
db->db_dirtycnt -= 1;
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
if (zfs_refcount_remove(&db->db_holds, (void *)(uintptr_t)txg) == 0) {
|
2023-03-10 22:59:53 +03:00
|
|
|
ASSERT(db->db_state == DB_NOFILL || brtwrite ||
|
|
|
|
arc_released(db->db_buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_destroy(db);
|
2013-09-04 16:00:57 +04:00
|
|
|
return (B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
return (B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
static void
|
|
|
|
dmu_buf_will_dirty_impl(dmu_buf_t *db_fake, int flags, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-12-09 22:37:51 +04:00
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
2023-04-30 12:47:09 +03:00
|
|
|
boolean_t undirty = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(tx->tx_txg != 0);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-11-04 23:37:33 +03:00
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Quick check for dirtiness. For already dirty blocks, this
|
2015-11-04 23:37:33 +03:00
|
|
|
* reduces runtime of this function by >90%, and overall performance
|
|
|
|
* by 50% for some workloads (e.g. file deletion with indirect blocks
|
|
|
|
* cached).
|
|
|
|
*/
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
|
2023-04-30 12:47:09 +03:00
|
|
|
if (db->db_state == DB_CACHED || db->db_state == DB_NOFILL) {
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr = dbuf_find_dirty_eq(db, tx->tx_txg);
|
2015-11-04 23:37:33 +03:00
|
|
|
/*
|
|
|
|
* It's possible that it is already dirty but not cached,
|
|
|
|
* because there are some calls to dbuf_dirty() that don't
|
|
|
|
* go through dmu_buf_will_dirty().
|
|
|
|
*/
|
2020-02-05 22:07:19 +03:00
|
|
|
if (dr != NULL) {
|
2023-04-30 12:47:09 +03:00
|
|
|
if (dr->dt.dl.dr_brtwrite) {
|
|
|
|
/*
|
|
|
|
* Block cloning: If we are dirtying a cloned
|
|
|
|
* block, we cannot simply redirty it, because
|
|
|
|
* this dr has no data associated with it.
|
|
|
|
* We will go through a full undirtying below,
|
|
|
|
* before dirtying it again.
|
|
|
|
*/
|
|
|
|
undirty = B_TRUE;
|
|
|
|
} else {
|
|
|
|
/* This dbuf is already dirty and cached. */
|
|
|
|
dbuf_redirty(dr);
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
return;
|
|
|
|
}
|
2015-11-04 23:37:33 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
if (RW_WRITE_HELD(&DB_DNODE(db)->dn_struct_rwlock))
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
flags |= DB_RF_HAVESTRUCT;
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2023-04-30 12:47:09 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Block cloning: Do the dbuf_read() before undirtying the dbuf, as we
|
|
|
|
* want to make sure dbuf_read() will read the pending cloned block and
|
|
|
|
* not the uderlying block that is being replaced. dbuf_undirty() will
|
|
|
|
* do dbuf_unoverride(), so we will end up with cloned block content,
|
|
|
|
* without overridden BP.
|
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
(void) dbuf_read(db, NULL, flags);
|
2023-04-30 12:47:09 +03:00
|
|
|
if (undirty) {
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
VERIFY(!dbuf_undirty(db, tx));
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) dbuf_dirty(db, tx);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
void
|
|
|
|
dmu_buf_will_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_will_dirty_impl(db_fake,
|
|
|
|
DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH, tx);
|
|
|
|
}
|
|
|
|
|
2019-03-06 20:50:55 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_buf_is_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr;
|
2019-03-06 20:50:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
2020-02-05 22:07:19 +03:00
|
|
|
dr = dbuf_find_dirty_eq(db, tx->tx_txg);
|
2019-03-06 20:50:55 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2020-02-05 22:07:19 +03:00
|
|
|
return (dr != NULL);
|
2019-03-06 20:50:55 +03:00
|
|
|
}
|
|
|
|
|
2023-04-30 12:47:09 +03:00
|
|
|
void
|
|
|
|
dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Block cloning: We are going to clone into this block, so undirty
|
|
|
|
* modifications done to this block so far in this txg. This includes
|
|
|
|
* writes and clones into this block.
|
|
|
|
*/
|
|
|
|
mutex_enter(&db->db_mtx);
|
dmu_buf_will_clone: fix race in transition back to NOFILL
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
2023-11-28 20:53:04 +03:00
|
|
|
DBUF_VERIFY(db);
|
2023-04-30 12:47:09 +03:00
|
|
|
VERIFY(!dbuf_undirty(db, tx));
|
2023-09-01 04:17:12 +03:00
|
|
|
ASSERT3P(dbuf_find_dirty_eq(db, tx->tx_txg), ==, NULL);
|
2023-04-30 12:47:09 +03:00
|
|
|
if (db->db_buf != NULL) {
|
|
|
|
arc_buf_destroy(db->db_buf, db);
|
|
|
|
db->db_buf = NULL;
|
dmu_buf_will_clone: fix race in transition back to NOFILL
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
2023-11-28 20:53:04 +03:00
|
|
|
dbuf_clear_data(db);
|
2023-04-30 12:47:09 +03:00
|
|
|
}
|
dmu_buf_will_clone: fix race in transition back to NOFILL
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
2023-11-28 20:53:04 +03:00
|
|
|
|
|
|
|
db->db_state = DB_NOFILL;
|
|
|
|
DTRACE_SET_STATE(db, "allocating NOFILL buffer for clone");
|
|
|
|
|
|
|
|
DBUF_VERIFY(db);
|
2023-04-30 12:47:09 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
dmu_buf_will_clone: fix race in transition back to NOFILL
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.
This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.
Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.
Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.
This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.
Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
2023-11-28 20:53:04 +03:00
|
|
|
dbuf_noread(db);
|
|
|
|
(void) dbuf_dirty(db, tx);
|
2023-04-30 12:47:09 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
void
|
|
|
|
dmu_buf_will_not_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
2023-05-19 23:05:53 +03:00
|
|
|
mutex_enter(&db->db_mtx);
|
2008-12-03 23:09:06 +03:00
|
|
|
db->db_state = DB_NOFILL;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "allocating NOFILL buffer");
|
2023-05-19 23:05:53 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2023-04-30 12:47:09 +03:00
|
|
|
|
|
|
|
dbuf_noread(db);
|
|
|
|
(void) dbuf_dirty(db, tx);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
dmu_buf_will_fill(dmu_buf_t *db_fake, dmu_tx_t *tx, boolean_t canfail)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(tx->tx_txg != 0);
|
|
|
|
ASSERT(db->db_level == 0);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT ||
|
|
|
|
dmu_tx_private_ok(tx));
|
|
|
|
|
2023-05-19 23:05:53 +03:00
|
|
|
mutex_enter(&db->db_mtx);
|
2023-04-30 12:47:09 +03:00
|
|
|
if (db->db_state == DB_NOFILL) {
|
|
|
|
/*
|
|
|
|
* Block cloning: We will be completely overwriting a block
|
|
|
|
* cloned in this transaction group, so let's undirty the
|
|
|
|
* pending clone and mark the block as uncached. This will be
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
* as if the clone was never done. But if the fill can fail
|
|
|
|
* we should have a way to return back to the cloned data.
|
2023-04-30 12:47:09 +03:00
|
|
|
*/
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
if (canfail && dbuf_find_dirty_eq(db, tx->tx_txg) != NULL) {
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
dmu_buf_will_dirty(db_fake, tx);
|
|
|
|
return;
|
|
|
|
}
|
2023-04-30 12:47:09 +03:00
|
|
|
VERIFY(!dbuf_undirty(db, tx));
|
|
|
|
db->db_state = DB_UNCACHED;
|
|
|
|
}
|
2023-05-19 23:05:53 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
2023-04-30 12:47:09 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_noread(db);
|
|
|
|
(void) dbuf_dirty(db, tx);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* This function is effectively the same as dmu_buf_will_dirty(), but
|
2018-04-17 21:06:54 +03:00
|
|
|
* indicates the caller expects raw encrypted data in the db, and provides
|
|
|
|
* the crypt params (byteorder, salt, iv, mac) which should be stored in the
|
|
|
|
* blkptr_t when this dbuf is written. This is only used for blocks of
|
|
|
|
* dnodes, during raw receive.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*/
|
|
|
|
void
|
2018-04-17 21:06:54 +03:00
|
|
|
dmu_buf_set_crypt_params(dmu_buf_t *db_fake, boolean_t byteorder,
|
|
|
|
const uint8_t *salt, const uint8_t *iv, const uint8_t *mac, dmu_tx_t *tx)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
dbuf_dirty_record_t *dr;
|
|
|
|
|
2018-04-17 21:06:54 +03:00
|
|
|
/*
|
|
|
|
* dr_has_raw_params is only processed for blocks of dnodes
|
|
|
|
* (see dbuf_sync_dnode_leaf_crypt()).
|
|
|
|
*/
|
|
|
|
ASSERT3U(db->db.db_object, ==, DMU_META_DNODE_OBJECT);
|
|
|
|
ASSERT3U(db->db_level, ==, 0);
|
|
|
|
ASSERT(db->db_objset->os_raw_receive);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_buf_will_dirty_impl(db_fake,
|
|
|
|
DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH | DB_RF_NO_DECRYPT, tx);
|
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
dr = dbuf_find_dirty_eq(db, tx->tx_txg);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
ASSERT3P(dr, !=, NULL);
|
2018-04-17 21:06:54 +03:00
|
|
|
|
|
|
|
dr->dt.dl.dr_has_raw_params = B_TRUE;
|
|
|
|
dr->dt.dl.dr_byteorder = byteorder;
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(dr->dt.dl.dr_salt, salt, ZIO_DATA_SALT_LEN);
|
|
|
|
memcpy(dr->dt.dl.dr_iv, iv, ZIO_DATA_IV_LEN);
|
|
|
|
memcpy(dr->dt.dl.dr_mac, mac, ZIO_DATA_MAC_LEN);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
static void
|
|
|
|
dbuf_override_impl(dmu_buf_impl_t *db, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
struct dirty_leaf *dl;
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
dr = list_head(&db->db_dirty_records);
|
2022-10-12 21:25:18 +03:00
|
|
|
ASSERT3P(dr, !=, NULL);
|
2020-02-05 22:07:19 +03:00
|
|
|
ASSERT3U(dr->dr_txg, ==, tx->tx_txg);
|
|
|
|
dl = &dr->dt.dl;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dl->dr_overridden_by = *bp;
|
|
|
|
dl->dr_override_state = DR_OVERRIDDEN;
|
2020-02-05 22:07:19 +03:00
|
|
|
dl->dr_overridden_by.blk_birth = dr->dr_txg;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
}
|
|
|
|
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_buf_fill_done(dmu_buf_t *dbuf, dmu_tx_t *tx, boolean_t failed)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) tx;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbuf;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
DBUF_VERIFY(db);
|
|
|
|
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
if (db->db_state == DB_FILL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_level == 0 && db->db_freed_in_flight) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2008-11-20 23:01:55 +03:00
|
|
|
/* we were freed while filling */
|
|
|
|
/* XXX dbuf_undirty? */
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(db->db.db_data, 0, db->db.db_size);
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_freed_in_flight = FALSE;
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
db->db_state = DB_CACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db,
|
|
|
|
"fill done handling freed in flight");
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
failed = B_FALSE;
|
|
|
|
} else if (failed) {
|
|
|
|
VERIFY(!dbuf_undirty(db, tx));
|
|
|
|
db->db_buf = NULL;
|
|
|
|
dbuf_clear_data(db);
|
|
|
|
DTRACE_SET_STATE(db, "fill failed");
|
2020-02-18 22:21:37 +03:00
|
|
|
} else {
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
db->db_state = DB_CACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "fill done");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
cv_broadcast(&db->db_changed);
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
} else {
|
|
|
|
db->db_state = DB_CACHED;
|
|
|
|
failed = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(&db->db_mtx);
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
return (failed);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
void
|
|
|
|
dmu_buf_write_embedded(dmu_buf_t *dbuf, void *data,
|
|
|
|
bp_embedded_type_t etype, enum zio_compress comp,
|
|
|
|
int uncompressed_size, int compressed_size, int byteorder,
|
|
|
|
dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbuf;
|
|
|
|
struct dirty_leaf *dl;
|
|
|
|
dmu_object_type_t type;
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr;
|
2014-06-06 01:19:08 +04:00
|
|
|
|
2015-07-24 19:53:55 +03:00
|
|
|
if (etype == BP_EMBEDDED_TYPE_DATA) {
|
|
|
|
ASSERT(spa_feature_is_active(dmu_objset_spa(db->db_objset),
|
|
|
|
SPA_FEATURE_EMBEDDED_DATA));
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
type = DB_DNODE(db)->dn_type;
|
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
|
|
|
|
ASSERT0(db->db_level);
|
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
|
|
|
|
|
|
|
dmu_buf_will_not_fill(dbuf, tx);
|
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
dr = list_head(&db->db_dirty_records);
|
2022-10-12 21:25:18 +03:00
|
|
|
ASSERT3P(dr, !=, NULL);
|
2020-02-05 22:07:19 +03:00
|
|
|
ASSERT3U(dr->dr_txg, ==, tx->tx_txg);
|
|
|
|
dl = &dr->dt.dl;
|
2014-06-06 01:19:08 +04:00
|
|
|
encode_embedded_bp_compressed(&dl->dr_overridden_by,
|
|
|
|
data, comp, uncompressed_size, compressed_size);
|
|
|
|
BPE_SET_ETYPE(&dl->dr_overridden_by, etype);
|
|
|
|
BP_SET_TYPE(&dl->dr_overridden_by, type);
|
|
|
|
BP_SET_LEVEL(&dl->dr_overridden_by, 0);
|
|
|
|
BP_SET_BYTEORDER(&dl->dr_overridden_by, byteorder);
|
|
|
|
|
|
|
|
dl->dr_override_state = DR_OVERRIDDEN;
|
2020-02-05 22:07:19 +03:00
|
|
|
dl->dr_overridden_by.blk_birth = dr->dr_txg;
|
2014-06-06 01:19:08 +04:00
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
void
|
|
|
|
dmu_buf_redact(dmu_buf_t *dbuf, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)dbuf;
|
|
|
|
dmu_object_type_t type;
|
|
|
|
ASSERT(dsl_dataset_feature_is_active(db->db_objset->os_dsl_dataset,
|
|
|
|
SPA_FEATURE_REDACTED_DATASETS));
|
|
|
|
|
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
type = DB_DNODE(db)->dn_type;
|
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
|
|
|
|
ASSERT0(db->db_level);
|
|
|
|
dmu_buf_will_not_fill(dbuf, tx);
|
|
|
|
|
|
|
|
blkptr_t bp = { { { {0} } } };
|
|
|
|
BP_SET_TYPE(&bp, type);
|
|
|
|
BP_SET_LEVEL(&bp, 0);
|
|
|
|
BP_SET_BIRTH(&bp, tx->tx_txg, 0);
|
|
|
|
BP_SET_REDACTED(&bp);
|
|
|
|
BPE_SET_LSIZE(&bp, dbuf->db_size);
|
|
|
|
|
|
|
|
dbuf_override_impl(db, &bp, tx);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Directly assign a provided arc buf to a given dbuf if it's not referenced
|
|
|
|
* by anybody except our caller. Otherwise copy arcbuf's contents to dbuf.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
dbuf_assign_arcbuf(dmu_buf_impl_t *db, arc_buf_t *buf, dmu_tx_t *tx)
|
|
|
|
{
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(db->db_level == 0);
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3U(dbuf_is_metadata(db), ==, arc_is_metadata(buf));
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(buf != NULL);
|
2019-05-08 01:18:44 +03:00
|
|
|
ASSERT3U(arc_buf_lsize(buf), ==, db->db.db_size);
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(tx->tx_txg != 0);
|
|
|
|
|
|
|
|
arc_return_buf(buf, db);
|
|
|
|
ASSERT(arc_released(buf));
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
|
|
|
|
while (db->db_state == DB_READ || db->db_state == DB_FILL)
|
|
|
|
cv_wait(&db->db_changed, &db->db_mtx);
|
|
|
|
|
2023-12-12 23:53:59 +03:00
|
|
|
ASSERT(db->db_state == DB_CACHED || db->db_state == DB_UNCACHED ||
|
|
|
|
db->db_state == DB_NOFILL);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
if (db->db_state == DB_CACHED &&
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_count(&db->db_holds) - 1 > db->db_dirtycnt) {
|
2017-09-28 18:49:13 +03:00
|
|
|
/*
|
|
|
|
* In practice, we will never have a case where we have an
|
|
|
|
* encrypted arc buffer while additional holds exist on the
|
|
|
|
* dbuf. We don't handle this here so we simply assert that
|
|
|
|
* fact instead.
|
|
|
|
*/
|
|
|
|
ASSERT(!arc_is_encrypted(buf));
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
(void) dbuf_dirty(db, tx);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(db->db.db_data, buf->b_data, db->db.db_size);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(buf, db);
|
2009-07-03 02:44:48 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db->db_state == DB_CACHED) {
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr = list_head(&db->db_dirty_records);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
ASSERT(db->db_buf != NULL);
|
|
|
|
if (dr != NULL && dr->dr_txg == tx->tx_txg) {
|
|
|
|
ASSERT(dr->dt.dl.dr_data == db->db_buf);
|
2017-09-28 18:49:13 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (!arc_released(db->db_buf)) {
|
|
|
|
ASSERT(dr->dt.dl.dr_override_state ==
|
|
|
|
DR_OVERRIDDEN);
|
|
|
|
arc_release(db->db_buf, db);
|
|
|
|
}
|
|
|
|
dr->dt.dl.dr_data = buf;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(db->db_buf, db);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else if (dr == NULL || dr->dt.dl.dr_data != db->db_buf) {
|
|
|
|
arc_release(db->db_buf, db);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(db->db_buf, db);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
db->db_buf = NULL;
|
2023-12-12 23:53:59 +03:00
|
|
|
} else if (db->db_state == DB_NOFILL) {
|
|
|
|
/*
|
|
|
|
* We will be completely replacing the cloned block. In case
|
|
|
|
* it was cloned in this transaction group, let's undirty the
|
|
|
|
* pending clone and mark the block as uncached. This will be
|
|
|
|
* as if the clone was never done.
|
|
|
|
*/
|
|
|
|
VERIFY(!dbuf_undirty(db, tx));
|
|
|
|
db->db_state = DB_UNCACHED;
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
ASSERT(db->db_buf == NULL);
|
|
|
|
dbuf_set_data(db, buf);
|
|
|
|
db->db_state = DB_FILL;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "filling assigned arcbuf");
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
(void) dbuf_dirty(db, tx);
|
dmu: Allow buffer fills to fail
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.
On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range. Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data. The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location. In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo. Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.
This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened. One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty(). It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15665
2023-12-15 20:51:41 +03:00
|
|
|
dmu_buf_fill_done(&db->db, tx, B_FALSE);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_destroy(dmu_buf_impl_t *db)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_t *dn;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *parent = db->db_parent;
|
2010-08-27 01:24:34 +04:00
|
|
|
dmu_buf_impl_t *dndb;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (db->db_buf != NULL) {
|
|
|
|
arc_buf_destroy(db->db_buf, db);
|
|
|
|
db->db_buf = NULL;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
|
|
|
int slots = DB_DNODE(db)->dn_num_slots;
|
|
|
|
int bonuslen = DN_SLOTS_TO_BONUSLEN(slots);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (db->db.db_data != NULL) {
|
|
|
|
kmem_free(db->db.db_data, bonuslen);
|
|
|
|
arc_space_return(bonuslen, ARC_SPACE_BONUS);
|
|
|
|
db->db_state = DB_UNCACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "buffer cleared");
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_clear_data(db);
|
|
|
|
|
|
|
|
if (multilist_link_active(&db->db_cache_link)) {
|
2018-07-10 20:49:50 +03:00
|
|
|
ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
|
|
|
|
db->db_caching_status == DB_DBUF_METADATA_CACHE);
|
|
|
|
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_remove(&dbuf_caches[db->db_caching_status].cache, db);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
2018-07-10 20:49:50 +03:00
|
|
|
&dbuf_caches[db->db_caching_status].size,
|
2016-06-02 07:04:53 +03:00
|
|
|
db->db.db_size, db);
|
2018-07-10 20:49:50 +03:00
|
|
|
|
|
|
|
if (db->db_caching_status == DB_DBUF_METADATA_CACHE) {
|
|
|
|
DBUF_STAT_BUMPDOWN(metadata_cache_count);
|
|
|
|
} else {
|
|
|
|
DBUF_STAT_BUMPDOWN(cache_levels[db->db_level]);
|
|
|
|
DBUF_STAT_BUMPDOWN(cache_count);
|
|
|
|
DBUF_STAT_DECR(cache_levels_bytes[db->db_level],
|
|
|
|
db->db.db_size);
|
|
|
|
}
|
|
|
|
db->db_caching_status = DB_NO_CACHE;
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(db->db_data_pending == NULL);
|
2020-02-27 03:09:17 +03:00
|
|
|
ASSERT(list_is_empty(&db->db_dirty_records));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
db->db_state = DB_EVICTING;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "buffer eviction started");
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_blkptr = NULL;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Now that db_state is DB_EVICTING, nobody else can find this via
|
|
|
|
* the hash table. We can now drop db_mtx, which allows us to
|
|
|
|
* acquire the dn_dbufs_mtx.
|
|
|
|
*/
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
|
|
|
dndb = dn->dn_dbuf;
|
2016-06-02 07:04:53 +03:00
|
|
|
if (db->db_blkid != DMU_BONUS_BLKID) {
|
|
|
|
boolean_t needlock = !MUTEX_HELD(&dn->dn_dbufs_mtx);
|
|
|
|
if (needlock)
|
2019-07-17 19:18:24 +03:00
|
|
|
mutex_enter_nested(&dn->dn_dbufs_mtx,
|
|
|
|
NESTED_SINGLE);
|
2015-04-03 06:14:28 +03:00
|
|
|
avl_remove(&dn->dn_dbufs, db);
|
2010-08-27 01:24:34 +04:00
|
|
|
membar_producer();
|
|
|
|
DB_DNODE_EXIT(db);
|
2016-06-02 07:04:53 +03:00
|
|
|
if (needlock)
|
|
|
|
mutex_exit(&dn->dn_dbufs_mtx);
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* Decrementing the dbuf count means that the hold corresponding
|
|
|
|
* to the removed dbuf is no longer discounted in dnode_move(),
|
|
|
|
* so the dnode cannot be moved until after we release the hold.
|
|
|
|
* The membar_producer() ensures visibility of the decremented
|
|
|
|
* value in dnode_move(), since DB_DNODE_EXIT doesn't actually
|
|
|
|
* release any lock.
|
|
|
|
*/
|
2018-05-31 20:29:12 +03:00
|
|
|
mutex_enter(&dn->dn_mtx);
|
2018-08-01 00:51:15 +03:00
|
|
|
dnode_rele_and_unlock(dn, db, B_TRUE);
|
2010-08-27 01:24:34 +04:00
|
|
|
db->db_dnode_handle = NULL;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
dbuf_hash_remove(db);
|
2010-08-27 01:24:34 +04:00
|
|
|
} else {
|
|
|
|
DB_DNODE_EXIT(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
db->db_parent = NULL;
|
|
|
|
|
|
|
|
ASSERT(db->db_buf == NULL);
|
|
|
|
ASSERT(db->db.db_data == NULL);
|
|
|
|
ASSERT(db->db_hash_next == NULL);
|
|
|
|
ASSERT(db->db_blkptr == NULL);
|
|
|
|
ASSERT(db->db_data_pending == NULL);
|
2018-07-10 20:49:50 +03:00
|
|
|
ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!multilist_link_active(&db->db_cache_link));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-08-27 01:24:34 +04:00
|
|
|
* If this dbuf is referenced from an indirect dbuf,
|
2008-11-20 23:01:55 +03:00
|
|
|
* decrement the ref count on the indirect dbuf.
|
|
|
|
*/
|
2018-05-31 20:29:12 +03:00
|
|
|
if (parent && parent != dndb) {
|
|
|
|
mutex_enter(&parent->db_mtx);
|
2018-08-01 00:51:15 +03:00
|
|
|
dbuf_rele_and_unlock(parent, db, B_TRUE);
|
2018-05-31 20:29:12 +03:00
|
|
|
}
|
2022-06-21 00:35:38 +03:00
|
|
|
|
|
|
|
kmem_cache_free(dbuf_kmem_cache, db);
|
|
|
|
arc_space_return(sizeof (dmu_buf_impl_t), ARC_SPACE_DBUF);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
/*
|
|
|
|
* Note: While bpp will always be updated if the function returns success,
|
|
|
|
* parentp will not be updated if the dnode does not have dn_dbuf filled in;
|
2018-02-14 01:54:54 +03:00
|
|
|
* this happens when the dnode is the meta-dnode, or {user|group|project}used
|
2015-12-22 04:31:57 +03:00
|
|
|
* object.
|
|
|
|
*/
|
2010-08-26 21:58:00 +04:00
|
|
|
__attribute__((always_inline))
|
|
|
|
static inline int
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_findbp(dnode_t *dn, int level, uint64_t blkid, int fail_sparse,
|
2018-08-31 20:16:54 +03:00
|
|
|
dmu_buf_impl_t **parentp, blkptr_t **bpp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
*parentp = NULL;
|
|
|
|
*bpp = NULL;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(blkid != DMU_BONUS_BLKID);
|
|
|
|
|
|
|
|
if (blkid == DMU_SPILL_BLKID) {
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
if (dn->dn_have_spill &&
|
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR))
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
*bpp = DN_SPILL_BLKPTR(dn->dn_phys);
|
2010-05-29 00:45:14 +04:00
|
|
|
else
|
|
|
|
*bpp = NULL;
|
|
|
|
dbuf_add_ref(dn->dn_dbuf, NULL);
|
|
|
|
*parentp = dn->dn_dbuf;
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
return (0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
int nlevels =
|
2016-08-10 00:06:39 +03:00
|
|
|
(dn->dn_phys->dn_nlevels == 0) ? 1 : dn->dn_phys->dn_nlevels;
|
2017-11-04 23:25:13 +03:00
|
|
|
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT3U(level * epbs, <, 64);
|
|
|
|
ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
|
2016-08-10 00:06:39 +03:00
|
|
|
/*
|
|
|
|
* This assertion shouldn't trip as long as the max indirect block size
|
|
|
|
* is less than 1M. The reason for this is that up to that point,
|
|
|
|
* the number of levels required to address an entire object with blocks
|
|
|
|
* of size SPA_MINBLOCKSIZE satisfies nlevels * epbs + 1 <= 64. In
|
|
|
|
* other words, if N * epbs + 1 > 64, then if (N-1) * epbs + 1 > 55
|
|
|
|
* (i.e. we can address the entire object), objects will all use at most
|
|
|
|
* N-1 levels and the assertion won't overflow. However, once epbs is
|
|
|
|
* 13, 4 * 13 + 1 = 53, but 5 * 13 + 1 = 66. Then, 4 levels will not be
|
|
|
|
* enough to address an entire object, so objects will have 5 levels,
|
|
|
|
* but then this assertion will overflow.
|
|
|
|
*
|
|
|
|
* All this is to say that if we ever increase DN_MAX_INDBLKSHIFT, we
|
|
|
|
* need to redo this logic to handle overflows.
|
|
|
|
*/
|
|
|
|
ASSERT(level >= nlevels ||
|
|
|
|
((nlevels - level - 1) * epbs) +
|
|
|
|
highbit64(dn->dn_phys->dn_nblkptr) <= 64);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (level >= nlevels ||
|
2016-08-10 00:06:39 +03:00
|
|
|
blkid >= ((uint64_t)dn->dn_phys->dn_nblkptr <<
|
|
|
|
((nlevels - level - 1) * epbs)) ||
|
|
|
|
(fail_sparse &&
|
|
|
|
blkid > (dn->dn_phys->dn_maxblkid >> (level * epbs)))) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/* the buffer has no parent yet */
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (level < nlevels-1) {
|
|
|
|
/* this block is referenced from an indirect block */
|
2010-08-26 21:52:00 +04:00
|
|
|
int err;
|
2019-10-04 01:33:38 +03:00
|
|
|
|
|
|
|
err = dbuf_hold_impl(dn, level + 1,
|
2018-08-31 20:16:54 +03:00
|
|
|
blkid >> epbs, fail_sparse, FALSE, NULL, parentp);
|
2019-10-04 01:33:38 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (err)
|
|
|
|
return (err);
|
|
|
|
err = dbuf_read(*parentp, NULL,
|
|
|
|
(DB_RF_HAVESTRUCT | DB_RF_NOPREFETCH | DB_RF_CANFAIL));
|
|
|
|
if (err) {
|
|
|
|
dbuf_rele(*parentp, NULL);
|
|
|
|
*parentp = NULL;
|
|
|
|
return (err);
|
|
|
|
}
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_enter(&(*parentp)->db_rwlock, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
*bpp = ((blkptr_t *)(*parentp)->db.db_data) +
|
|
|
|
(blkid & ((1ULL << epbs) - 1));
|
2016-08-10 00:06:39 +03:00
|
|
|
if (blkid > (dn->dn_phys->dn_maxblkid >> (level * epbs)))
|
|
|
|
ASSERT(BP_IS_HOLE(*bpp));
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_exit(&(*parentp)->db_rwlock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
} else {
|
|
|
|
/* the block is referenced from the dnode */
|
|
|
|
ASSERT3U(level, ==, nlevels-1);
|
|
|
|
ASSERT(dn->dn_phys->dn_nblkptr == 0 ||
|
|
|
|
blkid < dn->dn_phys->dn_nblkptr);
|
|
|
|
if (dn->dn_dbuf) {
|
|
|
|
dbuf_add_ref(dn->dn_dbuf, NULL);
|
|
|
|
*parentp = dn->dn_dbuf;
|
|
|
|
}
|
|
|
|
*bpp = &dn->dn_phys->dn_blkptr[blkid];
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static dmu_buf_impl_t *
|
|
|
|
dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
|
2022-12-14 04:29:21 +03:00
|
|
|
dmu_buf_impl_t *parent, blkptr_t *blkptr, uint64_t hash)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = dn->dn_objset;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db, *odb;
|
|
|
|
|
|
|
|
ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
|
|
|
|
ASSERT(dn->dn_type != DMU_OT_NONE);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
db = kmem_cache_alloc(dbuf_kmem_cache, KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
list_create(&db->db_dirty_records, sizeof (dbuf_dirty_record_t),
|
|
|
|
offsetof(dbuf_dirty_record_t, dr_dbuf_node));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_objset = os;
|
|
|
|
db->db.db_object = dn->dn_object;
|
|
|
|
db->db_level = level;
|
|
|
|
db->db_blkid = blkid;
|
|
|
|
db->db_dirtycnt = 0;
|
2010-08-27 01:24:34 +04:00
|
|
|
db->db_dnode_handle = dn->dn_handle;
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_parent = parent;
|
|
|
|
db->db_blkptr = blkptr;
|
2022-12-14 04:29:21 +03:00
|
|
|
db->db_hash = hash;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
db->db_user = NULL;
|
2015-10-14 00:09:45 +03:00
|
|
|
db->db_user_immediate_evict = FALSE;
|
|
|
|
db->db_freed_in_flight = FALSE;
|
|
|
|
db->db_pending_evict = FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (blkid == DMU_BONUS_BLKID) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3P(parent, ==, dn->dn_dbuf);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
db->db.db_size = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots) -
|
2008-11-20 23:01:55 +03:00
|
|
|
(dn->dn_nblkptr-1) * sizeof (blkptr_t);
|
|
|
|
ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
|
2010-05-29 00:45:14 +04:00
|
|
|
db->db.db_offset = DMU_BONUS_BLKID;
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_state = DB_UNCACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "bonus buffer created");
|
2018-07-10 20:49:50 +03:00
|
|
|
db->db_caching_status = DB_NO_CACHE;
|
2008-11-20 23:01:55 +03:00
|
|
|
/* the bonus dbuf is not placed in the hash table */
|
2016-07-13 15:42:40 +03:00
|
|
|
arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_DBUF);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (db);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else if (blkid == DMU_SPILL_BLKID) {
|
|
|
|
db->db.db_size = (blkptr != NULL) ?
|
|
|
|
BP_GET_LSIZE(blkptr) : SPA_MINBLOCKSIZE;
|
|
|
|
db->db.db_offset = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
int blocksize =
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
db->db_level ? 1 << dn->dn_indblkshift : dn->dn_datablksz;
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db.db_size = blocksize;
|
|
|
|
db->db.db_offset = db->db_blkid * blocksize;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hold the dn_dbufs_mtx while we get the new dbuf
|
|
|
|
* in the hash table *and* added to the dbufs list.
|
|
|
|
* This prevents a possible deadlock with someone
|
2019-09-03 03:56:41 +03:00
|
|
|
* trying to look up this dbuf before it's added to the
|
2008-11-20 23:01:55 +03:00
|
|
|
* dn_dbufs list.
|
|
|
|
*/
|
|
|
|
mutex_enter(&dn->dn_dbufs_mtx);
|
2020-02-18 22:21:37 +03:00
|
|
|
db->db_state = DB_EVICTING; /* not worth logging this state change */
|
2008-11-20 23:01:55 +03:00
|
|
|
if ((odb = dbuf_hash_insert(db)) != NULL) {
|
|
|
|
/* someone else inserted it first */
|
|
|
|
mutex_exit(&dn->dn_dbufs_mtx);
|
2021-07-01 18:30:31 +03:00
|
|
|
kmem_cache_free(dbuf_kmem_cache, db);
|
2018-01-29 21:24:52 +03:00
|
|
|
DBUF_STAT_BUMP(hash_insert_race);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (odb);
|
|
|
|
}
|
2015-04-03 06:14:28 +03:00
|
|
|
avl_add(&dn->dn_dbufs, db);
|
2017-01-27 02:15:48 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_state = DB_UNCACHED;
|
2020-02-18 22:21:37 +03:00
|
|
|
DTRACE_SET_STATE(db, "regular buffer created");
|
2018-07-10 20:49:50 +03:00
|
|
|
db->db_caching_status = DB_NO_CACHE;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&dn->dn_dbufs_mtx);
|
2016-07-13 15:42:40 +03:00
|
|
|
arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_DBUF);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (parent && parent != dn->dn_dbuf)
|
|
|
|
dbuf_add_ref(parent, db);
|
|
|
|
|
|
|
|
ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT ||
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_count(&dn->dn_holds) > 0);
|
2018-09-26 20:29:26 +03:00
|
|
|
(void) zfs_refcount_add(&dn->dn_holds, db);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dprintf_dbuf(db, "db=%p\n", db);
|
|
|
|
|
|
|
|
return (db);
|
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
/*
|
|
|
|
* This function returns a block pointer and information about the object,
|
|
|
|
* given a dnode and a block. This is a publicly accessible version of
|
|
|
|
* dbuf_findbp that only returns some information, rather than the
|
|
|
|
* dbuf. Note that the dnode passed in must be held, and the dn_struct_rwlock
|
|
|
|
* should be locked as (at least) a reader.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
dbuf_dnode_findbp(dnode_t *dn, uint64_t level, uint64_t blkid,
|
|
|
|
blkptr_t *bp, uint16_t *datablkszsec, uint8_t *indblkshift)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *dbp = NULL;
|
|
|
|
blkptr_t *bp2;
|
|
|
|
int err = 0;
|
|
|
|
ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
|
|
|
|
|
|
|
|
err = dbuf_findbp(dn, level, blkid, B_FALSE, &dbp, &bp2);
|
|
|
|
if (err == 0) {
|
2023-03-05 00:11:49 +03:00
|
|
|
ASSERT3P(bp2, !=, NULL);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
*bp = *bp2;
|
|
|
|
if (dbp != NULL)
|
|
|
|
dbuf_rele(dbp, NULL);
|
|
|
|
if (datablkszsec != NULL)
|
|
|
|
*datablkszsec = dn->dn_phys->dn_datablkszsec;
|
|
|
|
if (indblkshift != NULL)
|
|
|
|
*indblkshift = dn->dn_phys->dn_indblkshift;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
typedef struct dbuf_prefetch_arg {
|
|
|
|
spa_t *dpa_spa; /* The spa to issue the prefetch in. */
|
|
|
|
zbookmark_phys_t dpa_zb; /* The target block to prefetch. */
|
|
|
|
int dpa_epbs; /* Entries (blkptr_t's) Per Block Shift. */
|
|
|
|
int dpa_curlevel; /* The current level that we're reading */
|
2016-06-02 07:04:53 +03:00
|
|
|
dnode_t *dpa_dnode; /* The dnode associated with the prefetch */
|
2015-12-22 04:31:57 +03:00
|
|
|
zio_priority_t dpa_prio; /* The priority I/Os should be issued at. */
|
|
|
|
zio_t *dpa_zio; /* The parent zio_t for all prefetches. */
|
|
|
|
arc_flags_t dpa_aflags; /* Flags to pass to the final prefetch. */
|
2020-09-28 03:08:38 +03:00
|
|
|
dbuf_prefetch_fn dpa_cb; /* prefetch completion callback */
|
|
|
|
void *dpa_arg; /* prefetch completion arg */
|
2015-12-22 04:31:57 +03:00
|
|
|
} dbuf_prefetch_arg_t;
|
|
|
|
|
2020-09-28 03:08:38 +03:00
|
|
|
static void
|
|
|
|
dbuf_prefetch_fini(dbuf_prefetch_arg_t *dpa, boolean_t io_done)
|
|
|
|
{
|
2022-05-25 20:12:52 +03:00
|
|
|
if (dpa->dpa_cb != NULL) {
|
|
|
|
dpa->dpa_cb(dpa->dpa_arg, dpa->dpa_zb.zb_level,
|
|
|
|
dpa->dpa_zb.zb_blkid, io_done);
|
|
|
|
}
|
2020-09-28 03:08:38 +03:00
|
|
|
kmem_free(dpa, sizeof (*dpa));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_issue_final_prefetch_done(zio_t *zio, const zbookmark_phys_t *zb,
|
|
|
|
const blkptr_t *iobp, arc_buf_t *abuf, void *private)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zio, (void) zb, (void) iobp;
|
2020-09-28 03:08:38 +03:00
|
|
|
dbuf_prefetch_arg_t *dpa = private;
|
|
|
|
|
|
|
|
if (abuf != NULL)
|
|
|
|
arc_buf_destroy(abuf, private);
|
2022-06-21 00:32:03 +03:00
|
|
|
|
|
|
|
dbuf_prefetch_fini(dpa, B_TRUE);
|
2020-09-28 03:08:38 +03:00
|
|
|
}
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
/*
|
|
|
|
* Actually issue the prefetch read for the block given.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_issue_final_prefetch(dbuf_prefetch_arg_t *dpa, blkptr_t *bp)
|
|
|
|
{
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(!BP_IS_REDACTED(bp) ||
|
|
|
|
dsl_dataset_feature_is_active(
|
|
|
|
dpa->dpa_dnode->dn_objset->os_dsl_dataset,
|
|
|
|
SPA_FEATURE_REDACTED_DATASETS));
|
|
|
|
|
|
|
|
if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp) || BP_IS_REDACTED(bp))
|
2020-09-28 03:08:38 +03:00
|
|
|
return (dbuf_prefetch_fini(dpa, B_FALSE));
|
2015-12-22 04:31:57 +03:00
|
|
|
|
2018-03-31 21:11:48 +03:00
|
|
|
int zio_flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE;
|
2017-11-04 23:25:13 +03:00
|
|
|
arc_flags_t aflags =
|
2020-12-10 02:05:06 +03:00
|
|
|
dpa->dpa_aflags | ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH |
|
|
|
|
ARC_FLAG_NO_BUF;
|
2015-12-22 04:31:57 +03:00
|
|
|
|
2018-03-31 21:11:48 +03:00
|
|
|
/* dnodes are always read as raw and then converted later */
|
|
|
|
if (BP_GET_TYPE(bp) == DMU_OT_DNODE && BP_IS_PROTECTED(bp) &&
|
|
|
|
dpa->dpa_curlevel == 0)
|
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
ASSERT3U(dpa->dpa_curlevel, ==, BP_GET_LEVEL(bp));
|
|
|
|
ASSERT3U(dpa->dpa_curlevel, ==, dpa->dpa_zb.zb_level);
|
|
|
|
ASSERT(dpa->dpa_zio != NULL);
|
2020-09-28 03:08:38 +03:00
|
|
|
(void) arc_read(dpa->dpa_zio, dpa->dpa_spa, bp,
|
|
|
|
dbuf_issue_final_prefetch_done, dpa,
|
2018-03-31 21:11:48 +03:00
|
|
|
dpa->dpa_prio, zio_flags, &aflags, &dpa->dpa_zb);
|
2015-12-22 04:31:57 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called when an indirect block above our prefetch target is read in. This
|
|
|
|
* will either read in the next indirect block down the tree or issue the actual
|
|
|
|
* prefetch if the next block down is our target.
|
|
|
|
*/
|
|
|
|
static void
|
2017-11-16 04:27:01 +03:00
|
|
|
dbuf_prefetch_indirect_done(zio_t *zio, const zbookmark_phys_t *zb,
|
|
|
|
const blkptr_t *iobp, arc_buf_t *abuf, void *private)
|
2015-12-22 04:31:57 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zb, (void) iobp;
|
2015-12-22 04:31:57 +03:00
|
|
|
dbuf_prefetch_arg_t *dpa = private;
|
|
|
|
|
|
|
|
ASSERT3S(dpa->dpa_zb.zb_level, <, dpa->dpa_curlevel);
|
|
|
|
ASSERT3S(dpa->dpa_curlevel, >, 0);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2018-08-29 21:33:33 +03:00
|
|
|
if (abuf == NULL) {
|
|
|
|
ASSERT(zio == NULL || zio->io_error != 0);
|
2022-09-14 03:58:29 +03:00
|
|
|
dbuf_prefetch_fini(dpa, B_TRUE);
|
|
|
|
return;
|
2018-08-29 21:33:33 +03:00
|
|
|
}
|
|
|
|
ASSERT(zio == NULL || zio->io_error == 0);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* The dpa_dnode is only valid if we are called with a NULL
|
|
|
|
* zio. This indicates that the arc_read() returned without
|
|
|
|
* first calling zio_read() to issue a physical read. Once
|
|
|
|
* a physical read is made the dpa_dnode must be invalidated
|
|
|
|
* as the locks guarding it may have been dropped. If the
|
|
|
|
* dpa_dnode is still valid, then we want to add it to the dbuf
|
|
|
|
* cache. To do so, we must hold the dbuf associated with the block
|
|
|
|
* we just prefetched, read its contents so that we associate it
|
|
|
|
* with an arc_buf_t, and then release it.
|
|
|
|
*/
|
2015-12-22 04:31:57 +03:00
|
|
|
if (zio != NULL) {
|
|
|
|
ASSERT3S(BP_GET_LEVEL(zio->io_bp), ==, dpa->dpa_curlevel);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (zio->io_flags & ZIO_FLAG_RAW_COMPRESS) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(BP_GET_PSIZE(zio->io_bp), ==, zio->io_size);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(BP_GET_LSIZE(zio->io_bp), ==, zio->io_size);
|
|
|
|
}
|
2015-12-22 04:31:57 +03:00
|
|
|
ASSERT3P(zio->io_spa, ==, dpa->dpa_spa);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
dpa->dpa_dnode = NULL;
|
|
|
|
} else if (dpa->dpa_dnode != NULL) {
|
|
|
|
uint64_t curblkid = dpa->dpa_zb.zb_blkid >>
|
|
|
|
(dpa->dpa_epbs * (dpa->dpa_curlevel -
|
|
|
|
dpa->dpa_zb.zb_level));
|
|
|
|
dmu_buf_impl_t *db = dbuf_hold_level(dpa->dpa_dnode,
|
|
|
|
dpa->dpa_curlevel, curblkid, FTAG);
|
2019-01-18 02:47:08 +03:00
|
|
|
if (db == NULL) {
|
|
|
|
arc_buf_destroy(abuf, private);
|
2022-09-14 03:58:29 +03:00
|
|
|
dbuf_prefetch_fini(dpa, B_TRUE);
|
|
|
|
return;
|
2019-01-18 02:47:08 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) dbuf_read(db, NULL,
|
|
|
|
DB_RF_MUST_SUCCEED | DB_RF_NOPREFETCH | DB_RF_HAVESTRUCT);
|
|
|
|
dbuf_rele(db, FTAG);
|
2015-12-22 04:31:57 +03:00
|
|
|
}
|
|
|
|
|
2017-11-16 04:27:01 +03:00
|
|
|
dpa->dpa_curlevel--;
|
2017-11-04 23:25:13 +03:00
|
|
|
uint64_t nextblkid = dpa->dpa_zb.zb_blkid >>
|
2015-12-22 04:31:57 +03:00
|
|
|
(dpa->dpa_epbs * (dpa->dpa_curlevel - dpa->dpa_zb.zb_level));
|
2017-11-04 23:25:13 +03:00
|
|
|
blkptr_t *bp = ((blkptr_t *)abuf->b_data) +
|
2015-12-22 04:31:57 +03:00
|
|
|
P2PHASE(nextblkid, 1ULL << dpa->dpa_epbs);
|
2017-11-16 04:27:01 +03:00
|
|
|
|
2022-11-21 02:04:08 +03:00
|
|
|
ASSERT(!BP_IS_REDACTED(bp) || (dpa->dpa_dnode &&
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dsl_dataset_feature_is_active(
|
|
|
|
dpa->dpa_dnode->dn_objset->os_dsl_dataset,
|
2022-11-21 02:04:08 +03:00
|
|
|
SPA_FEATURE_REDACTED_DATASETS)));
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (BP_IS_HOLE(bp) || BP_IS_REDACTED(bp)) {
|
2022-09-14 03:58:29 +03:00
|
|
|
arc_buf_destroy(abuf, private);
|
2020-09-28 03:08:38 +03:00
|
|
|
dbuf_prefetch_fini(dpa, B_TRUE);
|
2022-09-14 03:58:29 +03:00
|
|
|
return;
|
2015-12-22 04:31:57 +03:00
|
|
|
} else if (dpa->dpa_curlevel == dpa->dpa_zb.zb_level) {
|
|
|
|
ASSERT3U(nextblkid, ==, dpa->dpa_zb.zb_blkid);
|
|
|
|
dbuf_issue_final_prefetch(dpa, bp);
|
|
|
|
} else {
|
|
|
|
arc_flags_t iter_aflags = ARC_FLAG_NOWAIT;
|
|
|
|
zbookmark_phys_t zb;
|
|
|
|
|
2017-11-02 18:01:56 +03:00
|
|
|
/* flag if L2ARC eligible, l2arc_noprefetch then decides */
|
|
|
|
if (dpa->dpa_aflags & ARC_FLAG_L2CACHE)
|
|
|
|
iter_aflags |= ARC_FLAG_L2CACHE;
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
ASSERT3U(dpa->dpa_curlevel, ==, BP_GET_LEVEL(bp));
|
|
|
|
|
|
|
|
SET_BOOKMARK(&zb, dpa->dpa_zb.zb_objset,
|
|
|
|
dpa->dpa_zb.zb_object, dpa->dpa_curlevel, nextblkid);
|
|
|
|
|
|
|
|
(void) arc_read(dpa->dpa_zio, dpa->dpa_spa,
|
2022-05-25 20:12:52 +03:00
|
|
|
bp, dbuf_prefetch_indirect_done, dpa,
|
|
|
|
ZIO_PRIORITY_SYNC_READ,
|
2015-12-22 04:31:57 +03:00
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
|
|
|
|
&iter_aflags, &zb);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
arc_buf_destroy(abuf, private);
|
2015-12-22 04:31:57 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Issue prefetch reads for the given block on the given level. If the indirect
|
|
|
|
* blocks above that block are not in memory, we will read them in
|
|
|
|
* asynchronously. As a result, this call never blocks waiting for a read to
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* complete. Note that the prefetch might fail if the dataset is encrypted and
|
|
|
|
* the encryption key is unmapped before the IO completes.
|
2015-12-22 04:31:57 +03:00
|
|
|
*/
|
2020-09-28 03:08:38 +03:00
|
|
|
int
|
|
|
|
dbuf_prefetch_impl(dnode_t *dn, int64_t level, uint64_t blkid,
|
|
|
|
zio_priority_t prio, arc_flags_t aflags, dbuf_prefetch_fn cb,
|
|
|
|
void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-12-22 04:31:57 +03:00
|
|
|
blkptr_t bp;
|
|
|
|
int epbs, nlevels, curlevel;
|
|
|
|
uint64_t curblkid;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(blkid != DMU_BONUS_BLKID);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
|
|
|
|
|
2015-12-27 00:10:31 +03:00
|
|
|
if (blkid > dn->dn_maxblkid)
|
2020-09-28 03:08:38 +03:00
|
|
|
goto no_issue;
|
2015-12-27 00:10:31 +03:00
|
|
|
|
2019-07-08 23:18:50 +03:00
|
|
|
if (level == 0 && dnode_block_freed(dn, blkid))
|
2020-09-28 03:08:38 +03:00
|
|
|
goto no_issue;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
/*
|
|
|
|
* This dnode hasn't been written to disk yet, so there's nothing to
|
|
|
|
* prefetch.
|
|
|
|
*/
|
|
|
|
nlevels = dn->dn_phys->dn_nlevels;
|
|
|
|
if (level >= nlevels || dn->dn_phys->dn_nblkptr == 0)
|
2020-09-28 03:08:38 +03:00
|
|
|
goto no_issue;
|
2015-12-22 04:31:57 +03:00
|
|
|
|
|
|
|
epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
|
|
|
|
if (dn->dn_phys->dn_maxblkid < blkid << (epbs * level))
|
2020-09-28 03:08:38 +03:00
|
|
|
goto no_issue;
|
2015-12-22 04:31:57 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
dmu_buf_impl_t *db = dbuf_find(dn->dn_objset, dn->dn_object,
|
2022-12-14 04:29:21 +03:00
|
|
|
level, blkid, NULL);
|
2015-12-22 04:31:57 +03:00
|
|
|
if (db != NULL) {
|
|
|
|
mutex_exit(&db->db_mtx);
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
2015-12-22 04:31:57 +03:00
|
|
|
* This dbuf already exists. It is either CACHED, or
|
|
|
|
* (we assume) about to be read or filled.
|
2010-08-27 01:24:34 +04:00
|
|
|
*/
|
2020-09-28 03:08:38 +03:00
|
|
|
goto no_issue;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
/*
|
|
|
|
* Find the closest ancestor (indirect block) of the target block
|
|
|
|
* that is present in the cache. In this indirect block, we will
|
|
|
|
* find the bp that is at curlevel, curblkid.
|
|
|
|
*/
|
|
|
|
curlevel = level;
|
|
|
|
curblkid = blkid;
|
|
|
|
while (curlevel < nlevels - 1) {
|
|
|
|
int parent_level = curlevel + 1;
|
|
|
|
uint64_t parent_blkid = curblkid >> epbs;
|
|
|
|
dmu_buf_impl_t *db;
|
|
|
|
|
|
|
|
if (dbuf_hold_impl(dn, parent_level, parent_blkid,
|
|
|
|
FALSE, TRUE, FTAG, &db) == 0) {
|
|
|
|
blkptr_t *bpp = db->db_buf->b_data;
|
|
|
|
bp = bpp[P2PHASE(curblkid, 1 << epbs)];
|
|
|
|
dbuf_rele(db, FTAG);
|
|
|
|
break;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
curlevel = parent_level;
|
|
|
|
curblkid = parent_blkid;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
if (curlevel == nlevels - 1) {
|
|
|
|
/* No cached indirect blocks found. */
|
|
|
|
ASSERT3U(curblkid, <, dn->dn_phys->dn_nblkptr);
|
|
|
|
bp = dn->dn_phys->dn_blkptr[curblkid];
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(!BP_IS_REDACTED(&bp) ||
|
|
|
|
dsl_dataset_feature_is_active(dn->dn_objset->os_dsl_dataset,
|
|
|
|
SPA_FEATURE_REDACTED_DATASETS));
|
|
|
|
if (BP_IS_HOLE(&bp) || BP_IS_REDACTED(&bp))
|
2020-09-28 03:08:38 +03:00
|
|
|
goto no_issue;
|
2015-12-22 04:31:57 +03:00
|
|
|
|
|
|
|
ASSERT3U(curlevel, ==, BP_GET_LEVEL(&bp));
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
zio_t *pio = zio_root(dmu_objset_spa(dn->dn_objset), NULL, NULL,
|
2015-12-22 04:31:57 +03:00
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
dbuf_prefetch_arg_t *dpa = kmem_zalloc(sizeof (*dpa), KM_SLEEP);
|
|
|
|
dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
|
2015-12-22 04:31:57 +03:00
|
|
|
SET_BOOKMARK(&dpa->dpa_zb, ds != NULL ? ds->ds_object : DMU_META_OBJSET,
|
|
|
|
dn->dn_object, level, blkid);
|
|
|
|
dpa->dpa_curlevel = curlevel;
|
|
|
|
dpa->dpa_prio = prio;
|
|
|
|
dpa->dpa_aflags = aflags;
|
|
|
|
dpa->dpa_spa = dn->dn_objset->os_spa;
|
2016-06-02 07:04:53 +03:00
|
|
|
dpa->dpa_dnode = dn;
|
2015-12-22 04:31:57 +03:00
|
|
|
dpa->dpa_epbs = epbs;
|
|
|
|
dpa->dpa_zio = pio;
|
2020-09-28 03:08:38 +03:00
|
|
|
dpa->dpa_cb = cb;
|
|
|
|
dpa->dpa_arg = arg;
|
2015-12-22 04:31:57 +03:00
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (!DNODE_LEVEL_IS_CACHEABLE(dn, level))
|
|
|
|
dpa->dpa_aflags |= ARC_FLAG_UNCACHED;
|
|
|
|
else if (dnode_level_is_l2cacheable(&bp, dn, level))
|
2017-11-02 18:01:56 +03:00
|
|
|
dpa->dpa_aflags |= ARC_FLAG_L2CACHE;
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
/*
|
|
|
|
* If we have the indirect just above us, no need to do the asynchronous
|
|
|
|
* prefetch chain; we'll just run the last step ourselves. If we're at
|
|
|
|
* a higher level, though, we want to issue the prefetches for all the
|
|
|
|
* indirect blocks asynchronously, so we can go on with whatever we were
|
|
|
|
* doing.
|
|
|
|
*/
|
|
|
|
if (curlevel == level) {
|
|
|
|
ASSERT3U(curblkid, ==, blkid);
|
|
|
|
dbuf_issue_final_prefetch(dpa, &bp);
|
|
|
|
} else {
|
|
|
|
arc_flags_t iter_aflags = ARC_FLAG_NOWAIT;
|
|
|
|
zbookmark_phys_t zb;
|
|
|
|
|
2017-11-02 18:01:56 +03:00
|
|
|
/* flag if L2ARC eligible, l2arc_noprefetch then decides */
|
2021-11-11 23:52:16 +03:00
|
|
|
if (dnode_level_is_l2cacheable(&bp, dn, level))
|
2017-11-02 18:01:56 +03:00
|
|
|
iter_aflags |= ARC_FLAG_L2CACHE;
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
SET_BOOKMARK(&zb, ds != NULL ? ds->ds_object : DMU_META_OBJSET,
|
|
|
|
dn->dn_object, curlevel, curblkid);
|
|
|
|
(void) arc_read(dpa->dpa_zio, dpa->dpa_spa,
|
2022-05-25 20:12:52 +03:00
|
|
|
&bp, dbuf_prefetch_indirect_done, dpa,
|
|
|
|
ZIO_PRIORITY_SYNC_READ,
|
2015-12-22 04:31:57 +03:00
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
|
|
|
|
&iter_aflags, &zb);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We use pio here instead of dpa_zio since it's possible that
|
|
|
|
* dpa may have already been freed.
|
|
|
|
*/
|
|
|
|
zio_nowait(pio);
|
2020-09-28 03:08:38 +03:00
|
|
|
return (1);
|
|
|
|
no_issue:
|
|
|
|
if (cb != NULL)
|
2022-05-25 20:12:52 +03:00
|
|
|
cb(arg, level, blkid, B_FALSE);
|
2020-09-28 03:08:38 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
dbuf_prefetch(dnode_t *dn, int64_t level, uint64_t blkid, zio_priority_t prio,
|
|
|
|
arc_flags_t aflags)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (dbuf_prefetch_impl(dn, level, blkid, prio, aflags, NULL, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-11-09 00:32:15 +03:00
|
|
|
/*
|
2019-10-04 01:33:38 +03:00
|
|
|
* Helper function for dbuf_hold_impl() to copy a buffer. Handles
|
2017-11-09 00:32:15 +03:00
|
|
|
* the case of encrypted, compressed and uncompressed buffers by
|
|
|
|
* allocating the new buffer, respectively, with arc_alloc_raw_buf(),
|
|
|
|
* arc_alloc_compressed_buf() or arc_alloc_buf().*
|
|
|
|
*
|
2019-10-04 01:33:38 +03:00
|
|
|
* NOTE: Declared noinline to avoid stack bloat in dbuf_hold_impl().
|
2017-11-09 00:32:15 +03:00
|
|
|
*/
|
|
|
|
noinline static void
|
2019-10-04 01:33:38 +03:00
|
|
|
dbuf_hold_copy(dnode_t *dn, dmu_buf_impl_t *db)
|
2017-11-09 00:32:15 +03:00
|
|
|
{
|
2019-10-04 01:33:38 +03:00
|
|
|
dbuf_dirty_record_t *dr = db->db_data_pending;
|
2021-06-23 07:39:15 +03:00
|
|
|
arc_buf_t *data = dr->dt.dl.dr_data;
|
|
|
|
enum zio_compress compress_type = arc_get_compression(data);
|
|
|
|
uint8_t complevel = arc_get_complevel(data);
|
|
|
|
|
|
|
|
if (arc_is_encrypted(data)) {
|
|
|
|
boolean_t byteorder;
|
|
|
|
uint8_t salt[ZIO_DATA_SALT_LEN];
|
|
|
|
uint8_t iv[ZIO_DATA_IV_LEN];
|
|
|
|
uint8_t mac[ZIO_DATA_MAC_LEN];
|
|
|
|
|
|
|
|
arc_get_raw_params(data, &byteorder, salt, iv, mac);
|
|
|
|
dbuf_set_data(db, arc_alloc_raw_buf(dn->dn_objset->os_spa, db,
|
|
|
|
dmu_objset_id(dn->dn_objset), byteorder, salt, iv, mac,
|
|
|
|
dn->dn_type, arc_buf_size(data), arc_buf_lsize(data),
|
|
|
|
compress_type, complevel));
|
|
|
|
} else if (compress_type != ZIO_COMPRESS_OFF) {
|
|
|
|
dbuf_set_data(db, arc_alloc_compressed_buf(
|
|
|
|
dn->dn_objset->os_spa, db, arc_buf_size(data),
|
|
|
|
arc_buf_lsize(data), compress_type, complevel));
|
|
|
|
} else {
|
|
|
|
dbuf_set_data(db, arc_alloc_buf(dn->dn_objset->os_spa, db,
|
|
|
|
DBUF_GET_BUFC_TYPE(db), db->db.db_size));
|
|
|
|
}
|
2017-11-09 00:32:15 +03:00
|
|
|
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_enter(&db->db_rwlock, RW_WRITER);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(db->db.db_data, data->b_data, arc_buf_size(data));
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_exit(&db->db_rwlock);
|
2017-11-09 00:32:15 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Returns with db_holds incremented, and db_mtx not held.
|
|
|
|
* Note: dn_struct_rwlock must be held.
|
|
|
|
*/
|
2019-10-04 01:33:38 +03:00
|
|
|
int
|
|
|
|
dbuf_hold_impl(dnode_t *dn, uint8_t level, uint64_t blkid,
|
|
|
|
boolean_t fail_sparse, boolean_t fail_uncached,
|
2022-04-19 21:38:30 +03:00
|
|
|
const void *tag, dmu_buf_impl_t **dbp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2019-10-04 01:33:38 +03:00
|
|
|
dmu_buf_impl_t *db, *parent = NULL;
|
2022-12-14 04:29:21 +03:00
|
|
|
uint64_t hv;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
/* If the pool has been created, verify the tx_sync_lock is not held */
|
2019-10-04 01:33:38 +03:00
|
|
|
spa_t *spa = dn->dn_objset->os_spa;
|
2019-07-26 20:54:14 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
|
|
|
if (dp != NULL) {
|
|
|
|
ASSERT(!MUTEX_HELD(&dp->dp_tx.tx_sync_lock));
|
|
|
|
}
|
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
ASSERT(blkid != DMU_BONUS_BLKID);
|
|
|
|
ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock));
|
|
|
|
ASSERT3U(dn->dn_nlevels, >, level);
|
|
|
|
|
|
|
|
*dbp = NULL;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* dbuf_find() returns with db_mtx held */
|
2022-12-14 04:29:21 +03:00
|
|
|
db = dbuf_find(dn->dn_objset, dn->dn_object, level, blkid, &hv);
|
2010-08-26 21:52:00 +04:00
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
if (db == NULL) {
|
|
|
|
blkptr_t *bp = NULL;
|
|
|
|
int err;
|
2010-08-26 21:52:00 +04:00
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
if (fail_uncached)
|
2015-12-22 04:31:57 +03:00
|
|
|
return (SET_ERROR(ENOENT));
|
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
ASSERT3P(parent, ==, NULL);
|
|
|
|
err = dbuf_findbp(dn, level, blkid, fail_sparse, &parent, &bp);
|
|
|
|
if (fail_sparse) {
|
|
|
|
if (err == 0 && bp && BP_IS_HOLE(bp))
|
|
|
|
err = SET_ERROR(ENOENT);
|
|
|
|
if (err) {
|
|
|
|
if (parent)
|
|
|
|
dbuf_rele(parent, NULL);
|
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
2019-10-04 01:33:38 +03:00
|
|
|
if (err && err != ENOENT)
|
|
|
|
return (err);
|
2022-12-14 04:29:21 +03:00
|
|
|
db = dbuf_create(dn, level, blkid, parent, bp, hv);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
if (fail_uncached && db->db_state != DB_CACHED) {
|
|
|
|
mutex_exit(&db->db_mtx);
|
2015-12-22 04:31:57 +03:00
|
|
|
return (SET_ERROR(ENOENT));
|
|
|
|
}
|
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
if (db->db_buf != NULL) {
|
|
|
|
arc_buf_access(db->db_buf);
|
|
|
|
ASSERT3P(db->db.db_data, ==, db->db_buf->b_data);
|
2018-01-08 20:52:36 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
ASSERT(db->db_buf == NULL || arc_referenced(db->db_buf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* If this buffer is currently syncing out, and we are
|
2008-11-20 23:01:55 +03:00
|
|
|
* still referencing it from db_data, we need to make a copy
|
|
|
|
* of it in case we decide we want to dirty it again in this txg.
|
|
|
|
*/
|
2019-10-04 01:33:38 +03:00
|
|
|
if (db->db_level == 0 && db->db_blkid != DMU_BONUS_BLKID &&
|
|
|
|
dn->dn_object != DMU_META_DNODE_OBJECT &&
|
|
|
|
db->db_state == DB_CACHED && db->db_data_pending) {
|
|
|
|
dbuf_dirty_record_t *dr = db->db_data_pending;
|
2023-02-07 12:51:56 +03:00
|
|
|
if (dr->dt.dl.dr_data == db->db_buf) {
|
|
|
|
ASSERT3P(db->db_buf, !=, NULL);
|
2019-10-04 01:33:38 +03:00
|
|
|
dbuf_hold_copy(dn, db);
|
2023-02-07 12:51:56 +03:00
|
|
|
}
|
2019-10-04 01:33:38 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (multilist_link_active(&db->db_cache_link)) {
|
|
|
|
ASSERT(zfs_refcount_is_zero(&db->db_holds));
|
|
|
|
ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
|
|
|
|
db->db_caching_status == DB_DBUF_METADATA_CACHE);
|
|
|
|
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_remove(&dbuf_caches[db->db_caching_status].cache, db);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
2019-10-04 01:33:38 +03:00
|
|
|
&dbuf_caches[db->db_caching_status].size,
|
|
|
|
db->db.db_size, db);
|
2018-07-10 20:49:50 +03:00
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
if (db->db_caching_status == DB_DBUF_METADATA_CACHE) {
|
2018-07-10 20:49:50 +03:00
|
|
|
DBUF_STAT_BUMPDOWN(metadata_cache_count);
|
|
|
|
} else {
|
2019-10-04 01:33:38 +03:00
|
|
|
DBUF_STAT_BUMPDOWN(cache_levels[db->db_level]);
|
2018-07-10 20:49:50 +03:00
|
|
|
DBUF_STAT_BUMPDOWN(cache_count);
|
2019-10-04 01:33:38 +03:00
|
|
|
DBUF_STAT_DECR(cache_levels_bytes[db->db_level],
|
|
|
|
db->db.db_size);
|
2018-07-10 20:49:50 +03:00
|
|
|
}
|
2019-10-04 01:33:38 +03:00
|
|
|
db->db_caching_status = DB_NO_CACHE;
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
2019-10-04 01:33:38 +03:00
|
|
|
(void) zfs_refcount_add(&db->db_holds, tag);
|
|
|
|
DBUF_VERIFY(db);
|
|
|
|
mutex_exit(&db->db_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* NOTE: we can't rele the parent until after we drop the db_mtx */
|
2019-10-04 01:33:38 +03:00
|
|
|
if (parent)
|
|
|
|
dbuf_rele(parent, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-10-04 01:33:38 +03:00
|
|
|
ASSERT3P(DB_DNODE(db), ==, dn);
|
|
|
|
ASSERT3U(db->db_blkid, ==, blkid);
|
|
|
|
ASSERT3U(db->db_level, ==, level);
|
|
|
|
*dbp = db;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_impl_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
dbuf_hold(dnode_t *dn, uint64_t blkid, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-12-22 04:31:57 +03:00
|
|
|
return (dbuf_hold_level(dn, 0, blkid, tag));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_impl_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
dbuf_hold_level(dnode_t *dn, int level, uint64_t blkid, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db;
|
2015-12-22 04:31:57 +03:00
|
|
|
int err = dbuf_hold_impl(dn, level, blkid, FALSE, FALSE, tag, &db);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err ? NULL : db);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dbuf_create_bonus(dnode_t *dn)
|
|
|
|
{
|
|
|
|
ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
|
|
|
|
|
|
|
|
ASSERT(dn->dn_bonus == NULL);
|
2022-12-14 04:29:21 +03:00
|
|
|
dn->dn_bonus = dbuf_create(dn, 0, DMU_BONUS_BLKID, dn->dn_dbuf, NULL,
|
|
|
|
dbuf_hash(dn->dn_objset, dn->dn_object, 0, DMU_BONUS_BLKID));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
dbuf_spill_set_blksz(dmu_buf_t *db_fake, uint64_t blksz, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid != DMU_SPILL_BLKID)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOTSUP));
|
2010-05-29 00:45:14 +04:00
|
|
|
if (blksz == 0)
|
|
|
|
blksz = SPA_MINBLOCKSIZE;
|
2014-11-03 23:15:08 +03:00
|
|
|
ASSERT3U(blksz, <=, spa_maxblocksize(dmu_objset_spa(db->db_objset)));
|
|
|
|
blksz = P2ROUNDUP(blksz, SPA_MINBLOCKSIZE);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dbuf_new_size(db, blksz, tx);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#pragma weak dmu_buf_add_ref = dbuf_add_ref
|
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dbuf_add_ref(dmu_buf_impl_t *db, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2018-09-26 20:29:26 +03:00
|
|
|
int64_t holds = zfs_refcount_add(&db->db_holds, tag);
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3S(holds, >, 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-04-02 14:59:15 +03:00
|
|
|
#pragma weak dmu_buf_try_add_ref = dbuf_try_add_ref
|
|
|
|
boolean_t
|
|
|
|
dbuf_try_add_ref(dmu_buf_t *db_fake, objset_t *os, uint64_t obj, uint64_t blkid,
|
2022-04-19 21:38:30 +03:00
|
|
|
const void *tag)
|
2015-04-02 14:59:15 +03:00
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
dmu_buf_impl_t *found_db;
|
|
|
|
boolean_t result = B_FALSE;
|
|
|
|
|
2015-05-29 02:14:19 +03:00
|
|
|
if (blkid == DMU_BONUS_BLKID)
|
2015-04-02 14:59:15 +03:00
|
|
|
found_db = dbuf_find_bonus(os, obj);
|
|
|
|
else
|
2022-12-14 04:29:21 +03:00
|
|
|
found_db = dbuf_find(os, obj, 0, blkid, NULL);
|
2015-04-02 14:59:15 +03:00
|
|
|
|
|
|
|
if (found_db != NULL) {
|
|
|
|
if (db == found_db && dbuf_refcount(db) > db->db_dirtycnt) {
|
2018-09-26 20:29:26 +03:00
|
|
|
(void) zfs_refcount_add(&db->db_holds, tag);
|
2015-04-02 14:59:15 +03:00
|
|
|
result = B_TRUE;
|
|
|
|
}
|
2015-05-29 02:14:19 +03:00
|
|
|
mutex_exit(&found_db->db_mtx);
|
2015-04-02 14:59:15 +03:00
|
|
|
}
|
|
|
|
return (result);
|
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* If you call dbuf_rele() you had better not be referencing the dnode handle
|
|
|
|
* unless you have some other direct or indirect hold on the dnode. (An indirect
|
|
|
|
* hold is a hold on one of the dnode's dbufs, including the bonus buffer.)
|
|
|
|
* Without that, the dbuf_rele() could lead to a dnode_rele() followed by the
|
|
|
|
* dnode's parent dbuf evicting its dnode handles.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dbuf_rele(dmu_buf_impl_t *db, const void *tag)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
mutex_enter(&db->db_mtx);
|
2018-08-01 00:51:15 +03:00
|
|
|
dbuf_rele_and_unlock(db, tag, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_buf_rele(dmu_buf_t *db, const void *tag)
|
2013-12-09 22:37:51 +04:00
|
|
|
{
|
|
|
|
dbuf_rele((dmu_buf_impl_t *)db, tag);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* dbuf_rele() for an already-locked dbuf. This is necessary to allow
|
2018-05-31 20:29:12 +03:00
|
|
|
* db_dirtycnt and db_holds to be updated atomically. The 'evicting'
|
|
|
|
* argument should be set if we are already in the dbuf-evicting code
|
|
|
|
* path, in which case we don't want to recursively evict. This allows us to
|
|
|
|
* avoid deeply nested stacks that would have a call flow similar to this:
|
|
|
|
*
|
|
|
|
* dbuf_rele()-->dbuf_rele_and_unlock()-->dbuf_evict_notify()
|
|
|
|
* ^ |
|
|
|
|
* | |
|
|
|
|
* +-----dbuf_destroy()<--dbuf_evict_one()<--------+
|
|
|
|
*
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dbuf_rele_and_unlock(dmu_buf_impl_t *db, const void *tag, boolean_t evicting)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int64_t holds;
|
2020-02-05 22:08:44 +03:00
|
|
|
uint64_t size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2008-11-20 23:01:55 +03:00
|
|
|
DBUF_VERIFY(db);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* Remove the reference to the dbuf before removing its hold on the
|
|
|
|
* dnode so we can guarantee in dnode_move() that a referenced bonus
|
|
|
|
* buffer has a corresponding dnode hold.
|
|
|
|
*/
|
2018-10-01 20:42:05 +03:00
|
|
|
holds = zfs_refcount_remove(&db->db_holds, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(holds >= 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't freeze indirects if there is a possibility that they
|
|
|
|
* may be modified in the current syncing context.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (db->db_buf != NULL &&
|
|
|
|
holds == (db->db_level == 0 ? db->db_dirtycnt : 0)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_freeze(db->db_buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (holds == db->db_dirtycnt &&
|
2015-10-14 00:09:45 +03:00
|
|
|
db->db_level == 0 && db->db_user_immediate_evict)
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_evict_user(db);
|
|
|
|
|
|
|
|
if (holds == 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
2015-03-12 03:10:35 +03:00
|
|
|
dnode_t *dn;
|
2015-10-14 00:09:45 +03:00
|
|
|
boolean_t evict_dbuf = db->db_pending_evict;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
/*
|
2015-03-12 03:10:35 +03:00
|
|
|
* If the dnode moves here, we cannot cross this
|
|
|
|
* barrier until the move completes.
|
2010-08-27 01:24:34 +04:00
|
|
|
*/
|
|
|
|
DB_DNODE_ENTER(db);
|
2015-03-12 03:10:35 +03:00
|
|
|
|
|
|
|
dn = DB_DNODE(db);
|
|
|
|
atomic_dec_32(&dn->dn_dbufs_count);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decrementing the dbuf count means that the bonus
|
|
|
|
* buffer's dnode hold is no longer discounted in
|
|
|
|
* dnode_move(). The dnode cannot move until after
|
2015-10-14 00:09:45 +03:00
|
|
|
* the dnode_rele() below.
|
2015-03-12 03:10:35 +03:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2015-03-12 03:10:35 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not reference db after its lock is dropped.
|
|
|
|
* Another thread may evict it.
|
|
|
|
*/
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
2015-10-14 00:09:45 +03:00
|
|
|
if (evict_dbuf)
|
2015-03-12 03:10:35 +03:00
|
|
|
dnode_evict_bonus(dn);
|
2015-10-14 00:09:45 +03:00
|
|
|
|
|
|
|
dnode_rele(dn, db);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (db->db_buf == NULL) {
|
|
|
|
/*
|
|
|
|
* This is a special case: we never associated this
|
|
|
|
* dbuf with any data allocated from the ARC.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(db->db_state == DB_UNCACHED ||
|
|
|
|
db->db_state == DB_NOFILL);
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_destroy(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (arc_released(db->db_buf)) {
|
|
|
|
/*
|
|
|
|
* This dbuf has anonymous data associated with it.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
dbuf_destroy(db);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
} else if (!(DBUF_IS_CACHEABLE(db) || db->db_partial_read) ||
|
|
|
|
db->db_pending_evict) {
|
|
|
|
dbuf_destroy(db);
|
|
|
|
} else if (!multilist_link_active(&db->db_cache_link)) {
|
|
|
|
ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
|
|
|
|
|
|
|
|
dbuf_cached_state_t dcs =
|
|
|
|
dbuf_include_in_metadata_cache(db) ?
|
|
|
|
DB_DBUF_METADATA_CACHE : DB_DBUF_CACHE;
|
|
|
|
db->db_caching_status = dcs;
|
|
|
|
|
|
|
|
multilist_insert(&dbuf_caches[dcs].cache, db);
|
|
|
|
uint64_t db_size = db->db.db_size;
|
|
|
|
size = zfs_refcount_add_many(
|
|
|
|
&dbuf_caches[dcs].size, db_size, db);
|
|
|
|
uint8_t db_level = db->db_level;
|
|
|
|
mutex_exit(&db->db_mtx);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (dcs == DB_DBUF_METADATA_CACHE) {
|
|
|
|
DBUF_STAT_BUMP(metadata_cache_count);
|
|
|
|
DBUF_STAT_MAX(metadata_cache_size_bytes_max,
|
|
|
|
size);
|
|
|
|
} else {
|
|
|
|
DBUF_STAT_BUMP(cache_count);
|
|
|
|
DBUF_STAT_MAX(cache_size_bytes_max, size);
|
|
|
|
DBUF_STAT_BUMP(cache_levels[db_level]);
|
|
|
|
DBUF_STAT_INCR(cache_levels_bytes[db_level],
|
|
|
|
db_size);
|
2014-07-15 11:43:18 +04:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (dcs == DB_DBUF_CACHE && !evicting)
|
|
|
|
dbuf_evict_notify(size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#pragma weak dmu_buf_refcount = dbuf_refcount
|
|
|
|
uint64_t
|
|
|
|
dbuf_refcount(dmu_buf_impl_t *db)
|
|
|
|
{
|
2018-10-01 20:42:05 +03:00
|
|
|
return (zfs_refcount_count(&db->db_holds));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-06-19 00:10:54 +03:00
|
|
|
uint64_t
|
|
|
|
dmu_buf_user_refcount(dmu_buf_t *db_fake)
|
|
|
|
{
|
|
|
|
uint64_t holds;
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT3U(zfs_refcount_count(&db->db_holds), >=, db->db_dirtycnt);
|
|
|
|
holds = zfs_refcount_count(&db->db_holds) - db->db_dirtycnt;
|
2018-06-19 00:10:54 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
|
|
|
return (holds);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void *
|
2015-04-02 06:44:32 +03:00
|
|
|
dmu_buf_replace_user(dmu_buf_t *db_fake, dmu_buf_user_t *old_user,
|
|
|
|
dmu_buf_user_t *new_user)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-04-02 06:44:32 +03:00
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
dbuf_verify_user(db, DBVU_NOT_EVICTING);
|
|
|
|
if (db->db_user == old_user)
|
|
|
|
db->db_user = new_user;
|
|
|
|
else
|
|
|
|
old_user = db->db_user;
|
|
|
|
dbuf_verify_user(db, DBVU_NOT_EVICTING);
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
|
|
|
return (old_user);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void *
|
2015-04-02 06:44:32 +03:00
|
|
|
dmu_buf_set_user(dmu_buf_t *db_fake, dmu_buf_user_t *user)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-04-02 06:44:32 +03:00
|
|
|
return (dmu_buf_replace_user(db_fake, NULL, user));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void *
|
2015-04-02 06:44:32 +03:00
|
|
|
dmu_buf_set_user_ie(dmu_buf_t *db_fake, dmu_buf_user_t *user)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
2015-10-14 00:09:45 +03:00
|
|
|
db->db_user_immediate_evict = TRUE;
|
2015-04-02 06:44:32 +03:00
|
|
|
return (dmu_buf_set_user(db_fake, user));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
void *
|
|
|
|
dmu_buf_remove_user(dmu_buf_t *db_fake, dmu_buf_user_t *user)
|
|
|
|
{
|
|
|
|
return (dmu_buf_replace_user(db_fake, user, NULL));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void *
|
|
|
|
dmu_buf_get_user(dmu_buf_t *db_fake)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
dbuf_verify_user(db, DBVU_NOT_EVICTING);
|
|
|
|
return (db->db_user);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2022-05-06 21:57:37 +03:00
|
|
|
dmu_buf_user_evict_wait(void)
|
2015-04-02 06:44:32 +03:00
|
|
|
{
|
|
|
|
taskq_wait(dbu_evict_taskq);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-05-10 23:47:54 +04:00
|
|
|
blkptr_t *
|
|
|
|
dmu_buf_get_blkptr(dmu_buf_t *db)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
|
|
|
|
return (dbi->db_blkptr);
|
|
|
|
}
|
|
|
|
|
2016-07-21 01:39:55 +03:00
|
|
|
objset_t *
|
|
|
|
dmu_buf_get_objset(dmu_buf_t *db)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
|
|
|
|
return (dbi->db_objset);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dbuf_check_blkptr(dnode_t *dn, dmu_buf_impl_t *db)
|
|
|
|
{
|
|
|
|
/* ASSERT(dmu_tx_is_syncing(tx) */
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
|
|
|
|
if (db->db_blkptr != NULL)
|
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_SPILL_BLKID) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
db->db_blkptr = DN_SPILL_BLKPTR(dn->dn_phys);
|
2010-05-29 00:45:14 +04:00
|
|
|
BP_ZERO(db->db_blkptr);
|
|
|
|
return;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_level == dn->dn_phys->dn_nlevels-1) {
|
|
|
|
/*
|
|
|
|
* This buffer was allocated at a time when there was
|
|
|
|
* no available blkptrs from the dnode, or it was
|
2019-09-03 03:56:41 +03:00
|
|
|
* inappropriate to hook it in (i.e., nlevels mismatch).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
ASSERT(db->db_blkid < dn->dn_phys->dn_nblkptr);
|
|
|
|
ASSERT(db->db_parent == NULL);
|
|
|
|
db->db_parent = dn->dn_dbuf;
|
|
|
|
db->db_blkptr = &dn->dn_phys->dn_blkptr[db->db_blkid];
|
|
|
|
DBUF_VERIFY(db);
|
|
|
|
} else {
|
|
|
|
dmu_buf_impl_t *parent = db->db_parent;
|
|
|
|
int epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
|
|
|
|
|
|
|
|
ASSERT(dn->dn_phys->dn_nlevels > 1);
|
|
|
|
if (parent == NULL) {
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
rw_enter(&dn->dn_struct_rwlock, RW_READER);
|
2015-12-22 04:31:57 +03:00
|
|
|
parent = dbuf_hold_level(dn, db->db_level + 1,
|
|
|
|
db->db_blkid >> epbs, db);
|
2008-11-20 23:01:55 +03:00
|
|
|
rw_exit(&dn->dn_struct_rwlock);
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
db->db_parent = parent;
|
|
|
|
}
|
|
|
|
db->db_blkptr = (blkptr_t *)parent->db.db_data +
|
|
|
|
(db->db_blkid & ((1ULL << epbs) - 1));
|
|
|
|
DBUF_VERIFY(db);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-02-08 01:22:29 +03:00
|
|
|
static void
|
|
|
|
dbuf_sync_bonus(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
|
|
|
void *data = dr->dt.dl.dr_data;
|
|
|
|
|
|
|
|
ASSERT0(db->db_level);
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
|
|
|
ASSERT(db->db_blkid == DMU_BONUS_BLKID);
|
|
|
|
ASSERT(data != NULL);
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2020-02-08 01:22:29 +03:00
|
|
|
ASSERT3U(DN_MAX_BONUS_LEN(dn->dn_phys), <=,
|
|
|
|
DN_SLOTS_TO_BONUSLEN(dn->dn_phys->dn_extra_slots + 1));
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(DN_BONUS(dn->dn_phys), data, DN_MAX_BONUS_LEN(dn->dn_phys));
|
2020-02-08 01:22:29 +03:00
|
|
|
|
|
|
|
dbuf_sync_leaf_verify_bonus_dnode(dr);
|
|
|
|
|
|
|
|
dbuf_undirty_bonus(dr);
|
|
|
|
dbuf_rele_and_unlock(db, (void *)(uintptr_t)tx->tx_txg, B_FALSE);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
2018-04-17 21:06:54 +03:00
|
|
|
* When syncing out a blocks of dnodes, adjust the block to deal with
|
|
|
|
* encryption. Normally, we make sure the block is decrypted before writing
|
|
|
|
* it. If we have crypt params, then we are writing a raw (encrypted) block,
|
|
|
|
* from a raw receive. In this case, set the ARC buf's crypt params so
|
|
|
|
* that the BP will be filled with the correct byteorder, salt, iv, and mac.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*/
|
|
|
|
static void
|
2018-04-17 21:06:54 +03:00
|
|
|
dbuf_prepare_encrypted_dnode_leaf(dbuf_dirty_record_t *dr)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&db->db_mtx));
|
2018-04-17 21:06:54 +03:00
|
|
|
ASSERT3U(db->db.db_object, ==, DMU_META_DNODE_OBJECT);
|
|
|
|
ASSERT3U(db->db_level, ==, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2018-04-17 21:06:54 +03:00
|
|
|
if (!db->db_objset->os_raw_receive && arc_is_encrypted(db->db_buf)) {
|
2018-03-31 21:12:51 +03:00
|
|
|
zbookmark_phys_t zb;
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Unfortunately, there is currently no mechanism for
|
|
|
|
* syncing context to handle decryption errors. An error
|
|
|
|
* here is only possible if an attacker maliciously
|
|
|
|
* changed a dnode block and updated the associated
|
|
|
|
* checksums going up the block tree.
|
|
|
|
*/
|
2018-03-31 21:12:51 +03:00
|
|
|
SET_BOOKMARK(&zb, dmu_objset_id(db->db_objset),
|
|
|
|
db->db.db_object, db->db_level, db->db_blkid);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = arc_untransform(db->db_buf, db->db_objset->os_spa,
|
2018-03-31 21:12:51 +03:00
|
|
|
&zb, B_TRUE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (err)
|
|
|
|
panic("Invalid dnode block MAC");
|
2018-04-17 21:06:54 +03:00
|
|
|
} else if (dr->dt.dl.dr_has_raw_params) {
|
|
|
|
(void) arc_release(dr->dt.dl.dr_data, db);
|
|
|
|
arc_convert_to_raw(dr->dt.dl.dr_data,
|
|
|
|
dmu_objset_id(db->db_objset),
|
|
|
|
dr->dt.dl.dr_byteorder, DMU_OT_DNODE,
|
|
|
|
dr->dt.dl.dr_salt, dr->dt.dl.dr_iv, dr->dt.dl.dr_mac);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* dbuf_sync_indirect() is called recursively from dbuf_sync_list() so it
|
2010-08-26 21:58:36 +04:00
|
|
|
* is critical the we not allow the compiler to inline this function in to
|
|
|
|
* dbuf_sync_list() thereby drastically bloating the stack usage.
|
|
|
|
*/
|
|
|
|
noinline static void
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_sync_indirect(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(dmu_tx_is_syncing(tx));
|
|
|
|
|
|
|
|
dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
|
|
|
|
ASSERT(db->db_level > 0);
|
|
|
|
DBUF_VERIFY(db);
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Read the block if it hasn't been read yet. */
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_buf == NULL) {
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED);
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
}
|
|
|
|
ASSERT3U(db->db_state, ==, DB_CACHED);
|
|
|
|
ASSERT(db->db_buf != NULL);
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Indirect block size must match what the dnode thinks it is. */
|
2010-08-27 01:24:34 +04:00
|
|
|
ASSERT3U(db->db.db_size, ==, 1<<dn->dn_phys->dn_indblkshift);
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_check_blkptr(dn, db);
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Provide the pending dirty record to child dbufs */
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_data_pending = dr;
|
|
|
|
|
|
|
|
mutex_exit(&db->db_mtx);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
dbuf_write(dr, db->db_buf, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
zio_t *zio = dr->dr_zio;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&dr->dt.di.dr_mtx);
|
2015-07-02 19:23:20 +03:00
|
|
|
dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(list_head(&dr->dt.di.dr_children) == NULL);
|
|
|
|
mutex_exit(&dr->dt.di.dr_mtx);
|
|
|
|
zio_nowait(zio);
|
|
|
|
}
|
|
|
|
|
2019-08-15 17:44:57 +03:00
|
|
|
/*
|
|
|
|
* Verify that the size of the data in our bonus buffer does not exceed
|
|
|
|
* its recorded size.
|
|
|
|
*
|
|
|
|
* The purpose of this verification is to catch any cases in development
|
|
|
|
* where the size of a phys structure (i.e space_map_phys_t) grows and,
|
|
|
|
* due to incorrect feature management, older pools expect to read more
|
|
|
|
* data even though they didn't actually write it to begin with.
|
|
|
|
*
|
|
|
|
* For a example, this would catch an error in the feature logic where we
|
|
|
|
* open an older pool and we expect to write the space map histogram of
|
|
|
|
* a space map with size SPACE_MAP_SIZE_V0.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_sync_leaf_verify_bonus_dnode(dbuf_dirty_record_t *dr)
|
|
|
|
{
|
2020-02-08 01:22:29 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2019-08-15 17:44:57 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Encrypted bonus buffers can have data past their bonuslen.
|
|
|
|
* Skip the verification of these blocks.
|
|
|
|
*/
|
|
|
|
if (DMU_OT_IS_ENCRYPTED(dn->dn_bonustype))
|
|
|
|
return;
|
|
|
|
|
|
|
|
uint16_t bonuslen = dn->dn_phys->dn_bonuslen;
|
|
|
|
uint16_t maxbonuslen = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots);
|
|
|
|
ASSERT3U(bonuslen, <=, maxbonuslen);
|
|
|
|
|
|
|
|
arc_buf_t *datap = dr->dt.dl.dr_data;
|
|
|
|
char *datap_end = ((char *)datap) + bonuslen;
|
|
|
|
char *datap_max = ((char *)datap) + maxbonuslen;
|
|
|
|
|
|
|
|
/* ensure that everything is zero after our data */
|
|
|
|
for (; datap_end < datap_max; datap_end++)
|
|
|
|
ASSERT(*datap_end == 0);
|
|
|
|
#endif
|
2020-02-08 01:22:29 +03:00
|
|
|
}
|
2019-08-15 17:44:57 +03:00
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
static blkptr_t *
|
|
|
|
dbuf_lightweight_bp(dbuf_dirty_record_t *dr)
|
|
|
|
{
|
|
|
|
/* This must be a lightweight dirty record. */
|
|
|
|
ASSERT3P(dr->dr_dbuf, ==, NULL);
|
|
|
|
dnode_t *dn = dr->dr_dnode;
|
|
|
|
|
|
|
|
if (dn->dn_phys->dn_nlevels == 1) {
|
|
|
|
VERIFY3U(dr->dt.dll.dr_blkid, <, dn->dn_phys->dn_nblkptr);
|
|
|
|
return (&dn->dn_phys->dn_blkptr[dr->dt.dll.dr_blkid]);
|
|
|
|
} else {
|
|
|
|
dmu_buf_impl_t *parent_db = dr->dr_parent->dr_dbuf;
|
|
|
|
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
|
|
|
|
VERIFY3U(parent_db->db_level, ==, 1);
|
|
|
|
VERIFY3P(parent_db->db_dnode_handle->dnh_dnode, ==, dn);
|
|
|
|
VERIFY3U(dr->dt.dll.dr_blkid >> epbs, ==, parent_db->db_blkid);
|
|
|
|
blkptr_t *bp = parent_db->db.db_data;
|
|
|
|
return (&bp[dr->dt.dll.dr_blkid & ((1 << epbs) - 1)]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_lightweight_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
dbuf_dirty_record_t *dr = zio->io_private;
|
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
|
|
|
|
if (zio->io_error != 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
dnode_t *dn = dr->dr_dnode;
|
|
|
|
|
|
|
|
blkptr_t *bp_orig = dbuf_lightweight_bp(dr);
|
|
|
|
spa_t *spa = dmu_objset_spa(dn->dn_objset);
|
|
|
|
int64_t delta = bp_get_dsize_sync(spa, bp) -
|
|
|
|
bp_get_dsize_sync(spa, bp_orig);
|
|
|
|
dnode_diduse_space(dn, delta);
|
|
|
|
|
|
|
|
uint64_t blkid = dr->dt.dll.dr_blkid;
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
if (blkid > dn->dn_phys->dn_maxblkid) {
|
|
|
|
ASSERT0(dn->dn_objset->os_raw_receive);
|
|
|
|
dn->dn_phys->dn_maxblkid = blkid;
|
|
|
|
}
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
|
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
uint64_t fill = BP_IS_HOLE(bp) ? 0 : 1;
|
|
|
|
BP_SET_FILL(bp, fill);
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_buf_impl_t *parent_db;
|
|
|
|
EQUIV(dr->dr_parent == NULL, dn->dn_phys->dn_nlevels == 1);
|
|
|
|
if (dr->dr_parent == NULL) {
|
|
|
|
parent_db = dn->dn_dbuf;
|
|
|
|
} else {
|
|
|
|
parent_db = dr->dr_parent->dr_dbuf;
|
|
|
|
}
|
|
|
|
rw_enter(&parent_db->db_rwlock, RW_WRITER);
|
|
|
|
*bp_orig = *bp;
|
|
|
|
rw_exit(&parent_db->db_rwlock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_lightweight_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
dbuf_dirty_record_t *dr = zio->io_private;
|
|
|
|
|
|
|
|
VERIFY0(zio->io_error);
|
|
|
|
|
|
|
|
objset_t *os = dr->dr_dnode->dn_objset;
|
|
|
|
dmu_tx_t *tx = os->os_synctx;
|
|
|
|
|
|
|
|
if (zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)) {
|
|
|
|
ASSERT(BP_EQUAL(zio->io_bp, &zio->io_bp_orig));
|
|
|
|
} else {
|
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
|
|
|
(void) dsl_dataset_block_kill(ds, &zio->io_bp_orig, tx, B_TRUE);
|
|
|
|
dsl_dataset_block_born(ds, zio->io_bp, tx);
|
|
|
|
}
|
|
|
|
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
dsl_pool_undirty_space(dmu_objset_pool(os), dr->dr_accounted,
|
|
|
|
zio->io_txg);
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
|
|
|
|
abd_free(dr->dt.dll.dr_abd);
|
|
|
|
kmem_free(dr, sizeof (*dr));
|
|
|
|
}
|
|
|
|
|
|
|
|
noinline static void
|
|
|
|
dbuf_sync_lightweight(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dnode_t *dn = dr->dr_dnode;
|
|
|
|
zio_t *pio;
|
|
|
|
if (dn->dn_phys->dn_nlevels == 1) {
|
|
|
|
pio = dn->dn_zio;
|
|
|
|
} else {
|
|
|
|
pio = dr->dr_parent->dr_zio;
|
|
|
|
}
|
|
|
|
|
|
|
|
zbookmark_phys_t zb = {
|
|
|
|
.zb_objset = dmu_objset_id(dn->dn_objset),
|
|
|
|
.zb_object = dn->dn_object,
|
|
|
|
.zb_level = 0,
|
|
|
|
.zb_blkid = dr->dt.dll.dr_blkid,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See comment in dbuf_write(). This is so that zio->io_bp_orig
|
|
|
|
* will have the old BP in dbuf_lightweight_done().
|
|
|
|
*/
|
|
|
|
dr->dr_bp_copy = *dbuf_lightweight_bp(dr);
|
|
|
|
|
|
|
|
dr->dr_zio = zio_write(pio, dmu_objset_spa(dn->dn_objset),
|
|
|
|
dmu_tx_get_txg(tx), &dr->dr_bp_copy, dr->dt.dll.dr_abd,
|
|
|
|
dn->dn_datablksz, abd_get_size(dr->dt.dll.dr_abd),
|
|
|
|
&dr->dt.dll.dr_props, dbuf_lightweight_ready, NULL,
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
dbuf_lightweight_done, dr, ZIO_PRIORITY_ASYNC_WRITE,
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
ZIO_FLAG_MUSTSUCCEED | dr->dt.dll.dr_flags, &zb);
|
|
|
|
|
|
|
|
zio_nowait(dr->dr_zio);
|
|
|
|
}
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* dbuf_sync_leaf() is called recursively from dbuf_sync_list() so it is
|
2010-08-26 21:58:36 +04:00
|
|
|
* critical the we not allow the compiler to inline this function in to
|
|
|
|
* dbuf_sync_list() thereby drastically bloating the stack usage.
|
|
|
|
*/
|
|
|
|
noinline static void
|
2008-11-20 23:01:55 +03:00
|
|
|
dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
arc_buf_t **datap = &dr->dt.dl.dr_data;
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2010-08-27 01:24:34 +04:00
|
|
|
objset_t *os;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg = tx->tx_txg;
|
|
|
|
|
|
|
|
ASSERT(dmu_tx_is_syncing(tx));
|
|
|
|
|
|
|
|
dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
/*
|
|
|
|
* To be synced, we must be dirtied. But we
|
|
|
|
* might have been freed after the dirty.
|
|
|
|
*/
|
|
|
|
if (db->db_state == DB_UNCACHED) {
|
|
|
|
/* This buffer has been freed since it was dirtied */
|
|
|
|
ASSERT(db->db.db_data == NULL);
|
|
|
|
} else if (db->db_state == DB_FILL) {
|
|
|
|
/* This buffer was freed and is now being re-filled */
|
|
|
|
ASSERT(db->db.db_data != dr->dt.dl.dr_data);
|
2023-07-24 11:02:21 +03:00
|
|
|
} else if (db->db_state == DB_READ) {
|
|
|
|
/*
|
|
|
|
* This buffer has a clone we need to write, and an in-flight
|
|
|
|
* read on the BP we're about to clone. Its safe to issue the
|
|
|
|
* write here because the read has already been issued and the
|
|
|
|
* contents won't change.
|
|
|
|
*/
|
|
|
|
ASSERT(dr->dt.dl.dr_brtwrite &&
|
|
|
|
dr->dt.dl.dr_override_state == DR_OVERRIDDEN);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
DBUF_VERIFY(db);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_SPILL_BLKID) {
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
2016-06-08 10:22:07 +03:00
|
|
|
if (!(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR)) {
|
|
|
|
/*
|
|
|
|
* In the previous transaction group, the bonus buffer
|
|
|
|
* was entirely used to store the attributes for the
|
|
|
|
* dnode which overrode the dn_spill field. However,
|
|
|
|
* when adding more attributes to the file a spill
|
|
|
|
* block was required to hold the extra attributes.
|
|
|
|
*
|
|
|
|
* Make sure to clear the garbage left in the dn_spill
|
|
|
|
* field from the previous attributes in the bonus
|
|
|
|
* buffer. Otherwise, after writing out the spill
|
|
|
|
* block to the new allocated dva, it will free
|
|
|
|
* the old block pointed to by the invalid dn_spill.
|
|
|
|
*/
|
|
|
|
db->db_blkptr = NULL;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dn->dn_phys->dn_flags |= DNODE_FLAG_SPILL_BLKPTR;
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If this is a bonus buffer, simply copy the bonus data into the
|
|
|
|
* dnode. It will be written out when the dnode is synced (and it
|
|
|
|
* will be synced, since it must have been dirty for dbuf_sync to
|
|
|
|
* be called).
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid == DMU_BONUS_BLKID) {
|
|
|
|
ASSERT(dr->dr_dbuf == db);
|
2020-02-08 01:22:29 +03:00
|
|
|
dbuf_sync_bonus(dr, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
os = dn->dn_objset;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This function may have dropped the db_mtx lock allowing a dmu_sync
|
|
|
|
* operation to sneak in. As a result, we need to ensure that we
|
|
|
|
* don't check the dr_override_state until we have returned from
|
|
|
|
* dbuf_check_blkptr.
|
|
|
|
*/
|
|
|
|
dbuf_check_blkptr(dn, db);
|
|
|
|
|
|
|
|
/*
|
2010-08-27 01:24:34 +04:00
|
|
|
* If this buffer is in the middle of an immediate write,
|
2008-11-20 23:01:55 +03:00
|
|
|
* wait for the synchronous IO to complete.
|
|
|
|
*/
|
|
|
|
while (dr->dt.dl.dr_override_state == DR_IN_DMU_SYNC) {
|
|
|
|
ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT);
|
|
|
|
cv_wait(&db->db_changed, &db->db_mtx);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* If this is a dnode block, ensure it is appropriately encrypted
|
|
|
|
* or decrypted, depending on what we are writing to it this txg.
|
|
|
|
*/
|
|
|
|
if (os->os_encrypted && dn->dn_object == DMU_META_DNODE_OBJECT)
|
2018-04-17 21:06:54 +03:00
|
|
|
dbuf_prepare_encrypted_dnode_leaf(dr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (db->db_state != DB_NOFILL &&
|
|
|
|
dn->dn_object != DMU_META_DNODE_OBJECT &&
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_count(&db->db_holds) > 1 &&
|
2010-05-29 00:45:14 +04:00
|
|
|
dr->dt.dl.dr_override_state != DR_OVERRIDDEN &&
|
2009-07-03 02:44:48 +04:00
|
|
|
*datap == db->db_buf) {
|
|
|
|
/*
|
|
|
|
* If this buffer is currently "in use" (i.e., there
|
|
|
|
* are active holds and db_data still references it),
|
|
|
|
* then make a copy before we start the write so that
|
|
|
|
* any modifications from the open txg will not leak
|
|
|
|
* into this write.
|
|
|
|
*
|
|
|
|
* NOTE: this copy does not need to be made for
|
|
|
|
* objects only modified in the syncing context (e.g.
|
|
|
|
* DNONE_DNODE blocks).
|
|
|
|
*/
|
2021-06-23 07:39:15 +03:00
|
|
|
int psize = arc_buf_size(*datap);
|
|
|
|
int lsize = arc_buf_lsize(*datap);
|
|
|
|
arc_buf_contents_t type = DBUF_GET_BUFC_TYPE(db);
|
|
|
|
enum zio_compress compress_type = arc_get_compression(*datap);
|
|
|
|
uint8_t complevel = arc_get_complevel(*datap);
|
|
|
|
|
|
|
|
if (arc_is_encrypted(*datap)) {
|
|
|
|
boolean_t byteorder;
|
|
|
|
uint8_t salt[ZIO_DATA_SALT_LEN];
|
|
|
|
uint8_t iv[ZIO_DATA_IV_LEN];
|
|
|
|
uint8_t mac[ZIO_DATA_MAC_LEN];
|
|
|
|
|
|
|
|
arc_get_raw_params(*datap, &byteorder, salt, iv, mac);
|
|
|
|
*datap = arc_alloc_raw_buf(os->os_spa, db,
|
|
|
|
dmu_objset_id(os), byteorder, salt, iv, mac,
|
|
|
|
dn->dn_type, psize, lsize, compress_type,
|
|
|
|
complevel);
|
|
|
|
} else if (compress_type != ZIO_COMPRESS_OFF) {
|
|
|
|
ASSERT3U(type, ==, ARC_BUFC_DATA);
|
|
|
|
*datap = arc_alloc_compressed_buf(os->os_spa, db,
|
|
|
|
psize, lsize, compress_type, complevel);
|
|
|
|
} else {
|
|
|
|
*datap = arc_alloc_buf(os->os_spa, db, type, psize);
|
|
|
|
}
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy((*datap)->b_data, db->db.db_data, psize);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
db->db_data_pending = dr;
|
|
|
|
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
dbuf_write(dr, *datap, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(!list_link_active(&dr->dr_dirty_node));
|
2010-08-27 01:24:34 +04:00
|
|
|
if (dn->dn_object == DMU_META_DNODE_OBJECT) {
|
2019-04-12 21:30:59 +03:00
|
|
|
list_insert_tail(&dn->dn_dirty_records[txg & TXG_MASK], dr);
|
2010-08-27 01:24:34 +04:00
|
|
|
} else {
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_nowait(dr->dr_zio);
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2015-07-02 19:23:20 +03:00
|
|
|
dbuf_sync_list(list_t *list, int level, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dbuf_dirty_record_t *dr;
|
|
|
|
|
2010-08-26 20:52:42 +04:00
|
|
|
while ((dr = list_head(list))) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dr->dr_zio != NULL) {
|
|
|
|
/*
|
|
|
|
* If we find an already initialized zio then we
|
|
|
|
* are processing the meta-dnode, and we have finished.
|
|
|
|
* The dbufs for all dnodes are put back on the list
|
|
|
|
* during processing, so that we can zio_wait()
|
|
|
|
* these IOs after initiating all child IOs.
|
|
|
|
*/
|
|
|
|
ASSERT3U(dr->dr_dbuf->db.db_object, ==,
|
|
|
|
DMU_META_DNODE_OBJECT);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
list_remove(list, dr);
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
if (dr->dr_dbuf == NULL) {
|
|
|
|
dbuf_sync_lightweight(dr, tx);
|
|
|
|
} else {
|
|
|
|
if (dr->dr_dbuf->db_blkid != DMU_BONUS_BLKID &&
|
|
|
|
dr->dr_dbuf->db_blkid != DMU_SPILL_BLKID) {
|
|
|
|
VERIFY3U(dr->dr_dbuf->db_level, ==, level);
|
|
|
|
}
|
|
|
|
if (dr->dr_dbuf->db_level > 0)
|
|
|
|
dbuf_sync_indirect(dr, tx);
|
|
|
|
else
|
|
|
|
dbuf_sync_leaf(dr, tx);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_write_ready(zio_t *zio, arc_buf_t *buf, void *vdb)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) buf;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db = vdb;
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_t *dn;
|
2008-12-03 23:09:06 +03:00
|
|
|
blkptr_t *bp = zio->io_bp;
|
2008-11-20 23:01:55 +03:00
|
|
|
blkptr_t *bp_orig = &zio->io_bp_orig;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *spa = zio->io_spa;
|
|
|
|
int64_t delta;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t fill = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
int i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-04-21 21:23:37 +03:00
|
|
|
ASSERT3P(db->db_blkptr, !=, NULL);
|
|
|
|
ASSERT3P(&db->db_data_pending->dr_bp_copy, ==, bp);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
2010-05-29 00:45:14 +04:00
|
|
|
delta = bp_get_dsize_sync(spa, bp) - bp_get_dsize_sync(spa, bp_orig);
|
|
|
|
dnode_diduse_space(dn, delta - zio->io_prev_space_delta);
|
|
|
|
zio->io_prev_space_delta = delta;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (bp->blk_birth != 0) {
|
|
|
|
ASSERT((db->db_blkid != DMU_SPILL_BLKID &&
|
|
|
|
BP_GET_TYPE(bp) == dn->dn_type) ||
|
|
|
|
(db->db_blkid == DMU_SPILL_BLKID &&
|
2014-06-06 01:19:08 +04:00
|
|
|
BP_GET_TYPE(bp) == dn->dn_bonustype) ||
|
|
|
|
BP_IS_EMBEDDED(bp));
|
2013-12-09 22:37:51 +04:00
|
|
|
ASSERT(BP_GET_LEVEL(bp) == db->db_level);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
if (db->db_blkid == DMU_SPILL_BLKID) {
|
|
|
|
ASSERT(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR);
|
2016-04-21 21:23:37 +03:00
|
|
|
ASSERT(!(BP_IS_HOLE(bp)) &&
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
db->db_blkptr == DN_SPILL_BLKPTR(dn->dn_phys));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_level == 0) {
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_blkid > dn->dn_phys->dn_maxblkid &&
|
2018-06-28 19:20:34 +03:00
|
|
|
db->db_blkid != DMU_SPILL_BLKID) {
|
|
|
|
ASSERT0(db->db_objset->os_raw_receive);
|
2008-11-20 23:01:55 +03:00
|
|
|
dn->dn_phys->dn_maxblkid = db->db_blkid;
|
2018-06-28 19:20:34 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
|
|
|
|
if (dn->dn_type == DMU_OT_DNODE) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
i = 0;
|
|
|
|
while (i < db->db.db_size) {
|
2017-06-29 20:18:03 +03:00
|
|
|
dnode_phys_t *dnp =
|
|
|
|
(void *)(((char *)db->db.db_data) + i);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
|
|
|
|
i += DNODE_MIN_SIZE;
|
|
|
|
if (dnp->dn_type != DMU_OT_NONE) {
|
2008-11-20 23:01:55 +03:00
|
|
|
fill++;
|
Verify block pointers before writing them out
If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.
However, such corruption is not typically recoverable. To recover from
it we would need to detect the memory error before the block pointer is
written to disk.
This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.
To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.
Additionally, when corruption is detected, the raw words of the block
pointer are logged. (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817
2023-05-08 21:20:23 +03:00
|
|
|
for (int j = 0; j < dnp->dn_nblkptr;
|
|
|
|
j++) {
|
|
|
|
(void) zfs_blkptr_verify(spa,
|
|
|
|
&dnp->dn_blkptr[j],
|
|
|
|
BLK_CONFIG_SKIP,
|
|
|
|
BLK_VERIFY_HALT);
|
|
|
|
}
|
|
|
|
if (dnp->dn_flags &
|
|
|
|
DNODE_FLAG_SPILL_BLKPTR) {
|
|
|
|
(void) zfs_blkptr_verify(spa,
|
|
|
|
DN_SPILL_BLKPTR(dnp),
|
|
|
|
BLK_CONFIG_SKIP,
|
|
|
|
BLK_VERIFY_HALT);
|
|
|
|
}
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
i += dnp->dn_extra_slots *
|
|
|
|
DNODE_MIN_SIZE;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
} else {
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_IS_HOLE(bp)) {
|
|
|
|
fill = 0;
|
|
|
|
} else {
|
|
|
|
fill = 1;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
} else {
|
2008-12-03 23:09:06 +03:00
|
|
|
blkptr_t *ibp = db->db.db_data;
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3U(db->db.db_size, ==, 1<<dn->dn_phys->dn_indblkshift);
|
2008-12-03 23:09:06 +03:00
|
|
|
for (i = db->db.db_size >> SPA_BLKPTRSHIFT; i > 0; i--, ibp++) {
|
|
|
|
if (BP_IS_HOLE(ibp))
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
Verify block pointers before writing them out
If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.
However, such corruption is not typically recoverable. To recover from
it we would need to detect the memory error before the block pointer is
written to disk.
This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.
To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.
Additionally, when corruption is detected, the raw words of the block
pointer are logged. (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817
2023-05-08 21:20:23 +03:00
|
|
|
(void) zfs_blkptr_verify(spa, ibp,
|
|
|
|
BLK_CONFIG_SKIP, BLK_VERIFY_HALT);
|
2014-06-06 01:19:08 +04:00
|
|
|
fill += BP_GET_FILL(ibp);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
DB_DNODE_EXIT(db);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp))
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
BP_SET_FILL(bp, fill);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_exit(&db->db_mtx);
|
2016-04-21 21:23:37 +03:00
|
|
|
|
2019-07-08 23:18:50 +03:00
|
|
|
db_lock_type_t dblt = dmu_buf_lock_parent(db, RW_WRITER, FTAG);
|
2016-04-21 21:23:37 +03:00
|
|
|
*db->db_blkptr = *bp;
|
2019-07-08 23:18:50 +03:00
|
|
|
dmu_buf_unlock_parent(db, dblt, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-05-15 18:02:28 +03:00
|
|
|
/*
|
|
|
|
* This function gets called just prior to running through the compression
|
|
|
|
* stage of the zio pipeline. If we're an indirect block comprised of only
|
|
|
|
* holes, then we want this indirect to be compressed away to a hole. In
|
|
|
|
* order to do that we must zero out any information about the holes that
|
|
|
|
* this indirect points to prior to before we try to compress it.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_write_children_ready(zio_t *zio, arc_buf_t *buf, void *vdb)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zio, (void) buf;
|
2016-05-15 18:02:28 +03:00
|
|
|
dmu_buf_impl_t *db = vdb;
|
|
|
|
dnode_t *dn;
|
|
|
|
blkptr_t *bp;
|
2017-01-28 23:11:09 +03:00
|
|
|
unsigned int epbs, i;
|
2016-05-15 18:02:28 +03:00
|
|
|
|
|
|
|
ASSERT3U(db->db_level, >, 0);
|
|
|
|
DB_DNODE_ENTER(db);
|
|
|
|
dn = DB_DNODE(db);
|
|
|
|
epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
|
2017-01-28 23:11:09 +03:00
|
|
|
ASSERT3U(epbs, <, 31);
|
2016-05-15 18:02:28 +03:00
|
|
|
|
|
|
|
/* Determine if all our children are holes */
|
2016-10-14 00:30:50 +03:00
|
|
|
for (i = 0, bp = db->db.db_data; i < 1ULL << epbs; i++, bp++) {
|
2016-05-15 18:02:28 +03:00
|
|
|
if (!BP_IS_HOLE(bp))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If all the children are holes, then zero them all out so that
|
|
|
|
* we may get compressed away.
|
|
|
|
*/
|
2016-10-14 00:30:50 +03:00
|
|
|
if (i == 1ULL << epbs) {
|
2017-01-28 23:11:09 +03:00
|
|
|
/*
|
|
|
|
* We only found holes. Grab the rwlock to prevent
|
|
|
|
* anybody from reading the blocks we're about to
|
|
|
|
* zero out.
|
|
|
|
*/
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_enter(&db->db_rwlock, RW_WRITER);
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(db->db.db_data, 0, db->db.db_size);
|
2019-07-08 23:18:50 +03:00
|
|
|
rw_exit(&db->db_rwlock);
|
2016-05-15 18:02:28 +03:00
|
|
|
}
|
|
|
|
DB_DNODE_EXIT(db);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dbuf_write_done(zio_t *zio, arc_buf_t *buf, void *vdb)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) buf;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_buf_impl_t *db = vdb;
|
2010-05-29 00:45:14 +04:00
|
|
|
blkptr_t *bp_orig = &zio->io_bp_orig;
|
2013-12-09 22:37:51 +04:00
|
|
|
blkptr_t *bp = db->db_blkptr;
|
|
|
|
objset_t *os = db->db_objset;
|
|
|
|
dmu_tx_t *tx = os->os_synctx;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(zio->io_error);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkptr == bp);
|
|
|
|
|
2013-05-10 23:47:54 +04:00
|
|
|
/*
|
|
|
|
* For nopwrites and rewrites we ensure that the bp matches our
|
|
|
|
* original and bypass all the accounting.
|
|
|
|
*/
|
|
|
|
if (zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(BP_EQUAL(bp, bp_orig));
|
|
|
|
} else {
|
2013-12-09 22:37:51 +04:00
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) dsl_dataset_block_kill(ds, bp_orig, tx, B_TRUE);
|
|
|
|
dsl_dataset_block_born(ds, bp, tx);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
DBUF_VERIFY(db);
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dbuf_dirty_record_t *dr = db->db_data_pending;
|
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!list_link_active(&dr->dr_dirty_node));
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(dr->dr_dbuf == db);
|
2020-02-05 22:07:19 +03:00
|
|
|
ASSERT(list_next(&db->db_dirty_records, dr) == NULL);
|
|
|
|
list_remove(&db->db_dirty_records, dr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
if (db->db_blkid == DMU_SPILL_BLKID) {
|
|
|
|
ASSERT(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR);
|
|
|
|
ASSERT(!(BP_IS_HOLE(db->db_blkptr)) &&
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
db->db_blkptr == DN_SPILL_BLKPTR(dn->dn_phys));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (db->db_level == 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(dr->dt.dl.dr_override_state == DR_NOT_OVERRIDDEN);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (db->db_state != DB_NOFILL) {
|
2023-03-10 22:59:53 +03:00
|
|
|
if (dr->dt.dl.dr_data != NULL &&
|
|
|
|
dr->dt.dl.dr_data != db->db_buf) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(dr->dt.dl.dr_data, db);
|
2023-03-10 22:59:53 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT(list_head(&dr->dt.di.dr_children) == NULL);
|
2013-12-09 22:37:51 +04:00
|
|
|
ASSERT3U(db->db.db_size, ==, 1 << dn->dn_phys->dn_indblkshift);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!BP_IS_HOLE(db->db_blkptr)) {
|
2019-12-05 23:37:00 +03:00
|
|
|
int epbs __maybe_unused = dn->dn_phys->dn_indblkshift -
|
|
|
|
SPA_BLKPTRSHIFT;
|
2013-12-09 22:37:51 +04:00
|
|
|
ASSERT3U(db->db_blkid, <=,
|
|
|
|
dn->dn_phys->dn_maxblkid >> (db->db_level * epbs));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3U(BP_GET_LSIZE(db->db_blkptr), ==,
|
|
|
|
db->db.db_size);
|
|
|
|
}
|
|
|
|
mutex_destroy(&dr->dt.di.dr_mtx);
|
|
|
|
list_destroy(&dr->dt.di.dr_children);
|
|
|
|
}
|
|
|
|
|
|
|
|
cv_broadcast(&db->db_changed);
|
|
|
|
ASSERT(db->db_dirtycnt > 0);
|
|
|
|
db->db_dirtycnt -= 1;
|
|
|
|
db->db_data_pending = NULL;
|
2018-08-01 00:51:15 +03:00
|
|
|
dbuf_rele_and_unlock(db, (void *)(uintptr_t)tx->tx_txg, B_FALSE);
|
dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes #9137
2019-08-16 02:53:53 +03:00
|
|
|
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
dsl_pool_undirty_space(dmu_objset_pool(os), dr->dr_accounted,
|
|
|
|
zio->io_txg);
|
dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes #9137
2019-08-16 02:53:53 +03:00
|
|
|
|
|
|
|
kmem_free(dr, sizeof (dbuf_dirty_record_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_write_nofill_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
dbuf_write_ready(zio, NULL, zio->io_private);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_write_nofill_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
dbuf_write_done(zio, NULL, zio->io_private);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_write_override_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
dbuf_dirty_record_t *dr = zio->io_private;
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
|
|
|
|
|
|
|
dbuf_write_ready(zio, NULL, db);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_write_override_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
dbuf_dirty_record_t *dr = zio->io_private;
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
|
|
|
blkptr_t *obp = &dr->dt.dl.dr_overridden_by;
|
|
|
|
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
if (!BP_EQUAL(zio->io_bp, obp)) {
|
|
|
|
if (!BP_IS_HOLE(obp))
|
|
|
|
dsl_free(spa_get_dsl(zio->io_spa), zio->io_txg, obp);
|
|
|
|
arc_release(dr->dt.dl.dr_data, db);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dbuf_write_done(zio, NULL, db);
|
2016-07-22 18:52:49 +03:00
|
|
|
|
|
|
|
if (zio->io_abd != NULL)
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(zio->io_abd);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
typedef struct dbuf_remap_impl_callback_arg {
|
|
|
|
objset_t *drica_os;
|
|
|
|
uint64_t drica_blk_birth;
|
|
|
|
dmu_tx_t *drica_tx;
|
|
|
|
} dbuf_remap_impl_callback_arg_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
dbuf_remap_impl_callback(uint64_t vdev, uint64_t offset, uint64_t size,
|
|
|
|
void *arg)
|
|
|
|
{
|
|
|
|
dbuf_remap_impl_callback_arg_t *drica = arg;
|
|
|
|
objset_t *os = drica->drica_os;
|
|
|
|
spa_t *spa = dmu_objset_spa(os);
|
|
|
|
dmu_tx_t *tx = drica->drica_tx;
|
|
|
|
|
|
|
|
ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
|
|
|
|
|
|
|
|
if (os == spa_meta_objset(spa)) {
|
|
|
|
spa_vdev_indirect_mark_obsolete(spa, vdev, offset, size, tx);
|
|
|
|
} else {
|
|
|
|
dsl_dataset_block_remapped(dmu_objset_ds(os), vdev, offset,
|
|
|
|
size, drica->drica_blk_birth, tx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2019-07-08 23:18:50 +03:00
|
|
|
dbuf_remap_impl(dnode_t *dn, blkptr_t *bp, krwlock_t *rw, dmu_tx_t *tx)
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
{
|
|
|
|
blkptr_t bp_copy = *bp;
|
|
|
|
spa_t *spa = dmu_objset_spa(dn->dn_objset);
|
|
|
|
dbuf_remap_impl_callback_arg_t drica;
|
|
|
|
|
|
|
|
ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
|
|
|
|
|
|
|
|
drica.drica_os = dn->dn_objset;
|
|
|
|
drica.drica_blk_birth = bp->blk_birth;
|
|
|
|
drica.drica_tx = tx;
|
|
|
|
if (spa_remap_blkptr(spa, &bp_copy, dbuf_remap_impl_callback,
|
|
|
|
&drica)) {
|
2019-07-26 20:54:14 +03:00
|
|
|
/*
|
|
|
|
* If the blkptr being remapped is tracked by a livelist,
|
|
|
|
* then we need to make sure the livelist reflects the update.
|
|
|
|
* First, cancel out the old blkptr by appending a 'FREE'
|
|
|
|
* entry. Next, add an 'ALLOC' to track the new version. This
|
|
|
|
* way we avoid trying to free an inaccurate blkptr at delete.
|
|
|
|
* Note that embedded blkptrs are not tracked in livelists.
|
|
|
|
*/
|
|
|
|
if (dn->dn_objset != spa_meta_objset(spa)) {
|
|
|
|
dsl_dataset_t *ds = dmu_objset_ds(dn->dn_objset);
|
|
|
|
if (dsl_deadlist_is_open(&ds->ds_dir->dd_livelist) &&
|
|
|
|
bp->blk_birth > ds->ds_dir->dd_origin_txg) {
|
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp));
|
|
|
|
ASSERT(dsl_dir_is_clone(ds->ds_dir));
|
|
|
|
ASSERT(spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_LIVELIST));
|
|
|
|
bplist_append(&ds->ds_dir->dd_pending_frees,
|
|
|
|
bp);
|
|
|
|
bplist_append(&ds->ds_dir->dd_pending_allocs,
|
|
|
|
&bp_copy);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
2019-07-08 23:18:50 +03:00
|
|
|
* The db_rwlock prevents dbuf_read_impl() from
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* dereferencing the BP while we are changing it. To
|
|
|
|
* avoid lock contention, only grab it when we are actually
|
|
|
|
* changing the BP.
|
|
|
|
*/
|
2019-07-08 23:18:50 +03:00
|
|
|
if (rw != NULL)
|
|
|
|
rw_enter(rw, RW_WRITER);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
*bp = bp_copy;
|
2019-07-08 23:18:50 +03:00
|
|
|
if (rw != NULL)
|
|
|
|
rw_exit(rw);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remap any existing BP's to concrete vdevs, if possible.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dbuf_remap(dnode_t *dn, dmu_buf_impl_t *db, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
spa_t *spa = dmu_objset_spa(db->db_objset);
|
|
|
|
ASSERT(dsl_pool_sync_context(spa_get_dsl(spa)));
|
|
|
|
|
|
|
|
if (!spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (db->db_level > 0) {
|
|
|
|
blkptr_t *bp = db->db.db_data;
|
|
|
|
for (int i = 0; i < db->db.db_size >> SPA_BLKPTRSHIFT; i++) {
|
2019-07-08 23:18:50 +03:00
|
|
|
dbuf_remap_impl(dn, &bp[i], &db->db_rwlock, tx);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
} else if (db->db.db_object == DMU_META_DNODE_OBJECT) {
|
|
|
|
dnode_phys_t *dnp = db->db.db_data;
|
|
|
|
ASSERT3U(db->db_dnode_handle->dnh_dnode->dn_type, ==,
|
|
|
|
DMU_OT_DNODE);
|
|
|
|
for (int i = 0; i < db->db.db_size >> DNODE_SHIFT;
|
|
|
|
i += dnp[i].dn_extra_slots + 1) {
|
|
|
|
for (int j = 0; j < dnp[i].dn_nblkptr; j++) {
|
2019-07-08 23:18:50 +03:00
|
|
|
krwlock_t *lock = (dn->dn_dbuf == NULL ? NULL :
|
|
|
|
&dn->dn_dbuf->db_rwlock);
|
|
|
|
dbuf_remap_impl(dn, &dnp[i].dn_blkptr[j], lock,
|
|
|
|
tx);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Issue I/O to commit a dirty buffer to disk. */
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
dbuf_write(dbuf_dirty_record_t *dr, arc_buf_t *data, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dmu_buf_impl_t *db = dr->dr_dbuf;
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn = dr->dr_dnode;
|
2010-08-27 01:24:34 +04:00
|
|
|
objset_t *os;
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_impl_t *parent = db->db_parent;
|
|
|
|
uint64_t txg = tx->tx_txg;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t zb;
|
2010-05-29 00:45:14 +04:00
|
|
|
zio_prop_t zp;
|
2020-02-27 03:09:17 +03:00
|
|
|
zio_t *pio; /* parent I/O */
|
2010-05-29 00:45:14 +04:00
|
|
|
int wp_flag = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-04-21 21:23:37 +03:00
|
|
|
ASSERT(dmu_tx_is_syncing(tx));
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
os = dn->dn_objset;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (db->db_state != DB_NOFILL) {
|
|
|
|
if (db->db_level > 0 || dn->dn_type == DMU_OT_DNODE) {
|
|
|
|
/*
|
|
|
|
* Private object buffers are released here rather
|
|
|
|
* than in dbuf_dirty() since they are only modified
|
|
|
|
* in the syncing context and we don't want the
|
|
|
|
* overhead of making multiple copies of the data.
|
|
|
|
*/
|
|
|
|
if (BP_IS_HOLE(db->db_blkptr)) {
|
|
|
|
arc_buf_thaw(data);
|
|
|
|
} else {
|
|
|
|
dbuf_release_bp(db);
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dbuf_remap(dn, db, tx);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (parent != dn->dn_dbuf) {
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Our parent is an indirect block. */
|
|
|
|
/* We have a dirty parent that has been scheduled for write. */
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(parent && parent->db_data_pending);
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Our parent's buffer is one level closer to the dnode. */
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(db->db_level == parent->db_level-1);
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* We're about to modify our parent's db_data by modifying
|
|
|
|
* our block pointer, so the parent must be released.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(arc_released(parent->db_buf));
|
2020-02-27 03:09:17 +03:00
|
|
|
pio = parent->db_data_pending->dr_zio;
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2013-06-11 21:12:34 +04:00
|
|
|
/* Our parent is the dnode itself. */
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT((db->db_level == dn->dn_phys->dn_nlevels-1 &&
|
|
|
|
db->db_blkid != DMU_SPILL_BLKID) ||
|
|
|
|
(db->db_blkid == DMU_SPILL_BLKID && db->db_level == 0));
|
|
|
|
if (db->db_blkid != DMU_SPILL_BLKID)
|
|
|
|
ASSERT3P(db->db_blkptr, ==,
|
|
|
|
&dn->dn_phys->dn_blkptr[db->db_blkid]);
|
2020-02-27 03:09:17 +03:00
|
|
|
pio = dn->dn_zio;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(db->db_level == 0 || data == db->db_buf);
|
|
|
|
ASSERT3U(db->db_blkptr->blk_birth, <=, txg);
|
2020-02-27 03:09:17 +03:00
|
|
|
ASSERT(pio);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
SET_BOOKMARK(&zb, os->os_dsl_dataset ?
|
|
|
|
os->os_dsl_dataset->ds_object : DMU_META_OBJSET,
|
|
|
|
db->db.db_object, db->db_level, db->db_blkid);
|
|
|
|
|
|
|
|
if (db->db_blkid == DMU_SPILL_BLKID)
|
|
|
|
wp_flag = WP_SPILL;
|
|
|
|
wp_flag |= (db->db_state == DB_NOFILL) ? WP_NOFILL : 0;
|
|
|
|
|
2017-03-23 19:07:27 +03:00
|
|
|
dmu_write_policy(os, dn, db->db_level, wp_flag, &zp);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-04-21 21:23:37 +03:00
|
|
|
/*
|
|
|
|
* We copy the blkptr now (rather than when we instantiate the dirty
|
|
|
|
* record), because its value can change between open context and
|
|
|
|
* syncing context. We do not need to hold dn_struct_rwlock to read
|
|
|
|
* db_blkptr because we are in syncing context.
|
|
|
|
*/
|
|
|
|
dr->dr_bp_copy = *db->db_blkptr;
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (db->db_level == 0 &&
|
|
|
|
dr->dt.dl.dr_override_state == DR_OVERRIDDEN) {
|
|
|
|
/*
|
|
|
|
* The BP for this block has been provided by open context
|
|
|
|
* (by dmu_sync() or dmu_buf_write_embedded()).
|
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_t *contents = (data != NULL) ?
|
|
|
|
abd_get_from_buf(data->b_data, arc_buf_size(data)) : NULL;
|
2014-06-06 01:19:08 +04:00
|
|
|
|
2020-02-27 03:09:17 +03:00
|
|
|
dr->dr_zio = zio_write(pio, os->os_spa, txg, &dr->dr_bp_copy,
|
|
|
|
contents, db->db.db_size, db->db.db_size, &zp,
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
dbuf_write_override_ready, NULL,
|
2016-05-15 18:02:28 +03:00
|
|
|
dbuf_write_override_done,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
dr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
|
|
|
|
zio_write_override(dr->dr_zio, &dr->dt.dl.dr_overridden_by,
|
2023-03-10 22:59:53 +03:00
|
|
|
dr->dt.dl.dr_copies, dr->dt.dl.dr_nopwrite,
|
|
|
|
dr->dt.dl.dr_brtwrite);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
} else if (db->db_state == DB_NOFILL) {
|
2016-06-16 01:47:05 +03:00
|
|
|
ASSERT(zp.zp_checksum == ZIO_CHECKSUM_OFF ||
|
|
|
|
zp.zp_checksum == ZIO_CHECKSUM_NOPARITY);
|
2020-02-27 03:09:17 +03:00
|
|
|
dr->dr_zio = zio_write(pio, os->os_spa, txg,
|
2016-07-11 20:45:52 +03:00
|
|
|
&dr->dr_bp_copy, NULL, db->db.db_size, db->db.db_size, &zp,
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
dbuf_write_nofill_ready, NULL,
|
2016-05-15 18:02:28 +03:00
|
|
|
dbuf_write_nofill_done, db,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZIO_PRIORITY_ASYNC_WRITE,
|
|
|
|
ZIO_FLAG_MUSTSUCCEED | ZIO_FLAG_NODATA, &zb);
|
|
|
|
} else {
|
|
|
|
ASSERT(arc_released(data));
|
2016-05-15 18:02:28 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For indirect blocks, we want to setup the children
|
|
|
|
* ready callback so that we can properly handle an indirect
|
|
|
|
* block that only contains holes.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
arc_write_done_func_t *children_ready_cb = NULL;
|
2016-05-15 18:02:28 +03:00
|
|
|
if (db->db_level != 0)
|
|
|
|
children_ready_cb = dbuf_write_children_ready;
|
|
|
|
|
2020-02-27 03:09:17 +03:00
|
|
|
dr->dr_zio = arc_write(pio, os->os_spa, txg,
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
&dr->dr_bp_copy, data, !DBUF_IS_CACHEABLE(db),
|
|
|
|
dbuf_is_l2cacheable(db), &zp, dbuf_write_ready,
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
children_ready_cb, dbuf_write_done, db,
|
|
|
|
ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2012-08-11 03:28:37 +04:00
|
|
|
EXPORT_SYMBOL(dbuf_find);
|
|
|
|
EXPORT_SYMBOL(dbuf_is_metadata);
|
2016-06-02 07:04:53 +03:00
|
|
|
EXPORT_SYMBOL(dbuf_destroy);
|
2012-08-11 03:28:37 +04:00
|
|
|
EXPORT_SYMBOL(dbuf_loan_arcbuf);
|
|
|
|
EXPORT_SYMBOL(dbuf_whichblock);
|
|
|
|
EXPORT_SYMBOL(dbuf_read);
|
|
|
|
EXPORT_SYMBOL(dbuf_unoverride);
|
|
|
|
EXPORT_SYMBOL(dbuf_free_range);
|
|
|
|
EXPORT_SYMBOL(dbuf_new_size);
|
|
|
|
EXPORT_SYMBOL(dbuf_release_bp);
|
|
|
|
EXPORT_SYMBOL(dbuf_dirty);
|
2018-04-17 21:06:54 +03:00
|
|
|
EXPORT_SYMBOL(dmu_buf_set_crypt_params);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_buf_will_dirty);
|
2019-03-06 20:50:55 +03:00
|
|
|
EXPORT_SYMBOL(dmu_buf_is_dirty);
|
2023-04-30 12:47:09 +03:00
|
|
|
EXPORT_SYMBOL(dmu_buf_will_clone);
|
2012-08-11 03:28:37 +04:00
|
|
|
EXPORT_SYMBOL(dmu_buf_will_not_fill);
|
|
|
|
EXPORT_SYMBOL(dmu_buf_will_fill);
|
|
|
|
EXPORT_SYMBOL(dmu_buf_fill_done);
|
2012-08-14 19:35:32 +04:00
|
|
|
EXPORT_SYMBOL(dmu_buf_rele);
|
2012-08-11 03:28:37 +04:00
|
|
|
EXPORT_SYMBOL(dbuf_assign_arcbuf);
|
|
|
|
EXPORT_SYMBOL(dbuf_prefetch);
|
|
|
|
EXPORT_SYMBOL(dbuf_hold_impl);
|
|
|
|
EXPORT_SYMBOL(dbuf_hold);
|
|
|
|
EXPORT_SYMBOL(dbuf_hold_level);
|
|
|
|
EXPORT_SYMBOL(dbuf_create_bonus);
|
|
|
|
EXPORT_SYMBOL(dbuf_spill_set_blksz);
|
|
|
|
EXPORT_SYMBOL(dbuf_rm_spill);
|
|
|
|
EXPORT_SYMBOL(dbuf_add_ref);
|
|
|
|
EXPORT_SYMBOL(dbuf_rele);
|
|
|
|
EXPORT_SYMBOL(dbuf_rele_and_unlock);
|
|
|
|
EXPORT_SYMBOL(dbuf_refcount);
|
|
|
|
EXPORT_SYMBOL(dbuf_sync_list);
|
|
|
|
EXPORT_SYMBOL(dmu_buf_set_user);
|
|
|
|
EXPORT_SYMBOL(dmu_buf_set_user_ie);
|
|
|
|
EXPORT_SYMBOL(dmu_buf_get_user);
|
2014-11-13 21:09:05 +03:00
|
|
|
EXPORT_SYMBOL(dmu_buf_get_blkptr);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf_cache, dbuf_cache_, max_bytes, U64, ZMOD_RW,
|
2016-12-12 21:46:26 +03:00
|
|
|
"Maximum size in bytes of the dbuf cache.");
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf_cache, dbuf_cache_, hiwater_pct, UINT, ZMOD_RW,
|
2022-01-21 19:07:15 +03:00
|
|
|
"Percentage over dbuf_cache_max_bytes for direct dbuf eviction.");
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf_cache, dbuf_cache_, lowater_pct, UINT, ZMOD_RW,
|
2022-01-21 19:07:15 +03:00
|
|
|
"Percentage below dbuf_cache_max_bytes when dbuf eviction stops.");
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf, dbuf_, metadata_cache_max_bytes, U64, ZMOD_RW,
|
2022-01-21 19:07:15 +03:00
|
|
|
"Maximum size in bytes of dbuf metadata cache.");
|
2018-07-10 20:49:50 +03:00
|
|
|
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf, dbuf_, cache_shift, UINT, ZMOD_RW,
|
2022-01-21 19:07:15 +03:00
|
|
|
"Set size of dbuf cache to log2 fraction of arc size.");
|
2018-07-10 20:49:50 +03:00
|
|
|
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf, dbuf_, metadata_cache_shift, UINT, ZMOD_RW,
|
2022-01-21 19:07:15 +03:00
|
|
|
"Set size of dbuf metadata cache to log2 fraction of arc size.");
|
2022-09-19 22:17:11 +03:00
|
|
|
|
|
|
|
ZFS_MODULE_PARAM(zfs_dbuf, dbuf_, mutex_cache_shift, UINT, ZMOD_RD,
|
|
|
|
"Set size of dbuf cache mutex array as log2 shift.");
|