2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.
This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
2019-08-06 00:34:27 +03:00
|
|
|
* Copyright (c) 2011, 2019 by Delphix. All rights reserved.
|
2017-04-13 19:40:56 +03:00
|
|
|
* Copyright (c) 2014 Integros [integros.com]
|
2017-02-04 01:18:28 +03:00
|
|
|
* Copyright 2016 Nexenta Systems, Inc.
|
2018-08-20 20:05:23 +03:00
|
|
|
* Copyright (c) 2017, 2018 Lawrence Livermore National Security, LLC.
|
2017-04-13 19:40:56 +03:00
|
|
|
* Copyright (c) 2015, 2017, Intel Corporation.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <stdio.h>
|
2013-03-25 01:24:51 +04:00
|
|
|
#include <unistd.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <stdio_ext.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
#include <ctype.h>
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/spa_impl.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
|
|
|
#include <sys/zfs_znode.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/zfs_sa.h>
|
|
|
|
#include <sys/sa.h>
|
|
|
|
#include <sys/sa_impl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/vdev.h>
|
|
|
|
#include <sys/vdev_impl.h>
|
|
|
|
#include <sys/metaslab_impl.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/dsl_dir.h>
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
#include <sys/dsl_pool.h>
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
#include <sys/dsl_bookmark.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dbuf.h>
|
|
|
|
#include <sys/zil.h>
|
|
|
|
#include <sys/zil_impl.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <sys/resource.h>
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
#include <sys/dmu_send.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/dmu_traverse.h>
|
|
|
|
#include <sys/zio_checksum.h>
|
|
|
|
#include <sys/zio_compress.h>
|
|
|
|
#include <sys/zfs_fuid.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/arc.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/ddt.h>
|
2012-12-14 03:24:15 +04:00
|
|
|
#include <sys/zfeature.h>
|
2016-07-22 18:52:49 +03:00
|
|
|
#include <sys/abd.h>
|
2017-05-01 21:06:07 +03:00
|
|
|
#include <sys/blkptr.h>
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#include <sys/dsl_crypt.h>
|
2018-10-12 21:28:26 +03:00
|
|
|
#include <sys/dsl_scan.h>
|
2013-08-28 15:45:09 +04:00
|
|
|
#include <zfs_comutil.h>
|
2018-11-05 22:22:33 +03:00
|
|
|
|
|
|
|
#include <libnvpair.h>
|
|
|
|
#include <libzutil.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
#include "zdb.h"
|
|
|
|
|
2013-01-12 04:42:50 +04:00
|
|
|
#define ZDB_COMPRESS_NAME(idx) ((idx) < ZIO_COMPRESS_FUNCTIONS ? \
|
|
|
|
zio_compress_table[(idx)].ci_name : "UNKNOWN")
|
|
|
|
#define ZDB_CHECKSUM_NAME(idx) ((idx) < ZIO_CHECKSUM_FUNCTIONS ? \
|
|
|
|
zio_checksum_table[(idx)].ci_name : "UNKNOWN")
|
|
|
|
#define ZDB_OT_TYPE(idx) ((idx) < DMU_OT_NUMTYPES ? (idx) : \
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
(idx) == DMU_OTN_ZAP_DATA || (idx) == DMU_OTN_ZAP_METADATA ? \
|
|
|
|
DMU_OT_ZAP_OTHER : \
|
|
|
|
(idx) == DMU_OTN_UINT64_DATA || (idx) == DMU_OTN_UINT64_METADATA ? \
|
|
|
|
DMU_OT_UINT64_OTHER : DMU_OT_NUMTYPES)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-08-01 20:42:04 +03:00
|
|
|
static char *
|
|
|
|
zdb_ot_name(dmu_object_type_t type)
|
|
|
|
{
|
|
|
|
if (type < DMU_OT_NUMTYPES)
|
|
|
|
return (dmu_ot[type].ot_name);
|
|
|
|
else if ((type & DMU_OT_NEWTYPE) &&
|
2016-12-12 21:46:26 +03:00
|
|
|
((type & DMU_OT_BYTESWAP_MASK) < DMU_BSWAP_NUMFUNCS))
|
2016-08-01 20:42:04 +03:00
|
|
|
return (dmu_ot_byteswap[type & DMU_OT_BYTESWAP_MASK].ob_name);
|
|
|
|
else
|
|
|
|
return ("UNKNOWN");
|
|
|
|
}
|
|
|
|
|
2017-02-01 01:36:35 +03:00
|
|
|
extern int reference_tracking_enable;
|
2010-05-29 00:45:14 +04:00
|
|
|
extern int zfs_recover;
|
2014-09-17 00:24:48 +04:00
|
|
|
extern uint64_t zfs_arc_max, zfs_arc_meta_limit;
|
2015-05-15 02:41:29 +03:00
|
|
|
extern int zfs_vdev_async_read_max_active;
|
2018-01-31 02:25:19 +03:00
|
|
|
extern boolean_t spa_load_verify_dryrun;
|
2019-01-17 01:10:02 +03:00
|
|
|
extern int zfs_reconstruct_indirect_combinations_max;
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
extern int zfs_btree_verify_intensity;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char cmdname[] = "zdb";
|
2008-11-20 23:01:55 +03:00
|
|
|
uint8_t dump_opt[256];
|
|
|
|
|
|
|
|
typedef void object_viewer_t(objset_t *, uint64_t, void *data, size_t size);
|
|
|
|
|
|
|
|
uint64_t *zopt_object = NULL;
|
2017-10-27 22:46:35 +03:00
|
|
|
static unsigned zopt_objects = 0;
|
2019-08-13 17:11:57 +03:00
|
|
|
uint64_t max_inflight_bytes = 256 * 1024 * 1024; /* 256MB */
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
static int leaked_objects = 0;
|
2018-10-12 21:28:26 +03:00
|
|
|
static range_tree_t *mos_refd_objs;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
static void snprintf_blkptr_compact(char *, size_t, const blkptr_t *,
|
|
|
|
boolean_t);
|
2018-10-12 21:28:26 +03:00
|
|
|
static void mos_obj_refd(uint64_t);
|
|
|
|
static void mos_obj_refd_multiple(uint64_t);
|
2015-04-27 01:27:36 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* These libumem hooks provide a reasonable set of defaults for the allocator's
|
|
|
|
* debugging facilities.
|
|
|
|
*/
|
|
|
|
const char *
|
2010-08-26 20:52:41 +04:00
|
|
|
_umem_debug_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
return ("default,verbose"); /* $UMEM_DEBUG setting */
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *
|
|
|
|
_umem_logging_init(void)
|
|
|
|
{
|
|
|
|
return ("fail,contents"); /* $UMEM_LOGGING setting */
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
usage(void)
|
|
|
|
{
|
|
|
|
(void) fprintf(stderr,
|
2016-12-17 01:11:29 +03:00
|
|
|
"Usage:\t%s [-AbcdDFGhikLMPsvX] [-e [-V] [-p <path> ...]] "
|
2017-04-13 19:40:56 +03:00
|
|
|
"[-I <inflight I/Os>]\n"
|
|
|
|
"\t\t[-o <var>=<value>]... [-t <txg>] [-U <cache>] [-x <dumpdir>]\n"
|
|
|
|
"\t\t[<poolname> [<object> ...]]\n"
|
2018-01-30 02:05:03 +03:00
|
|
|
"\t%s [-AdiPv] [-e [-V] [-p <path> ...]] [-U <cache>] <dataset>\n"
|
|
|
|
"\t\t[<object> ...]\n"
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
"\t%s [-v] <bookmark>\n"
|
2017-04-13 19:40:56 +03:00
|
|
|
"\t%s -C [-A] [-U <cache>]\n"
|
|
|
|
"\t%s -l [-Aqu] <device>\n"
|
2017-04-14 00:28:46 +03:00
|
|
|
"\t%s -m [-AFLPX] [-e [-V] [-p <path> ...]] [-t <txg>] "
|
|
|
|
"[-U <cache>]\n\t\t<poolname> [<vdev> [<metaslab> ...]]\n"
|
2017-04-13 19:40:56 +03:00
|
|
|
"\t%s -O <dataset> <path>\n"
|
2017-04-14 00:28:46 +03:00
|
|
|
"\t%s -R [-A] [-e [-V] [-p <path> ...]] [-U <cache>]\n"
|
2017-04-13 19:40:56 +03:00
|
|
|
"\t\t<poolname> <vdev>:<offset>:<size>[:<flags>]\n"
|
2017-05-01 21:06:07 +03:00
|
|
|
"\t%s -E [-A] word0:word1:...:word15\n"
|
2017-04-14 00:28:46 +03:00
|
|
|
"\t%s -S [-AP] [-e [-V] [-p <path> ...]] [-U <cache>] "
|
|
|
|
"<poolname>\n\n",
|
2017-04-13 19:40:56 +03:00
|
|
|
cmdname, cmdname, cmdname, cmdname, cmdname, cmdname, cmdname,
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
cmdname, cmdname, cmdname);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) fprintf(stderr, " Dataset name must include at least one "
|
|
|
|
"separator character '/' or '@'\n");
|
|
|
|
(void) fprintf(stderr, " If dataset name is specified, only that "
|
|
|
|
"dataset is dumped\n");
|
|
|
|
(void) fprintf(stderr, " If object numbers are specified, only "
|
|
|
|
"those objects are dumped\n\n");
|
|
|
|
(void) fprintf(stderr, " Options to control amount of output:\n");
|
|
|
|
(void) fprintf(stderr, " -b block statistics\n");
|
|
|
|
(void) fprintf(stderr, " -c checksum all metadata (twice for "
|
2009-07-03 02:44:48 +04:00
|
|
|
"all data) blocks\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -C config (or cachefile if alone)\n");
|
|
|
|
(void) fprintf(stderr, " -d dataset(s)\n");
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, " -D dedup statistics\n");
|
2017-05-01 21:06:07 +03:00
|
|
|
(void) fprintf(stderr, " -E decode and display block from an "
|
|
|
|
"embedded block pointer\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -h pool history\n");
|
|
|
|
(void) fprintf(stderr, " -i intent logs\n");
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) fprintf(stderr, " -l read label contents\n");
|
2016-12-17 01:11:29 +03:00
|
|
|
(void) fprintf(stderr, " -k examine the checkpointed state "
|
|
|
|
"of the pool\n");
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) fprintf(stderr, " -L disable leak tracking (do not "
|
|
|
|
"load spacemaps)\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -m metaslabs\n");
|
|
|
|
(void) fprintf(stderr, " -M metaslab groups\n");
|
|
|
|
(void) fprintf(stderr, " -O perform object lookups by path\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, " -R read and display block from a "
|
2017-04-13 19:40:56 +03:00
|
|
|
"device\n");
|
|
|
|
(void) fprintf(stderr, " -s report stats on zdb's I/O\n");
|
|
|
|
(void) fprintf(stderr, " -S simulate dedup to measure effect\n");
|
|
|
|
(void) fprintf(stderr, " -v verbose (applies to all "
|
|
|
|
"others)\n\n");
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, " Below options are intended for use "
|
2016-01-01 16:42:58 +03:00
|
|
|
"with other options:\n");
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, " -A ignore assertions (-A), enable "
|
|
|
|
"panic recovery (-AA) or both (-AAA)\n");
|
|
|
|
(void) fprintf(stderr, " -e pool is exported/destroyed/"
|
|
|
|
"has altroot/not in a cachefile\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -F attempt automatic rewind within "
|
|
|
|
"safe range of transaction groups\n");
|
2017-01-28 23:16:43 +03:00
|
|
|
(void) fprintf(stderr, " -G dump zfs_dbgmsg buffer before "
|
|
|
|
"exiting\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -I <number of inflight I/Os> -- "
|
|
|
|
"specify the maximum number of\n "
|
|
|
|
"checksumming I/Os [default is 200]\n");
|
2017-01-31 21:13:10 +03:00
|
|
|
(void) fprintf(stderr, " -o <variable>=<value> set global "
|
2017-04-13 19:40:56 +03:00
|
|
|
"variable to an unsigned 32-bit integer\n");
|
|
|
|
(void) fprintf(stderr, " -p <path> -- use one or more with "
|
|
|
|
"-e to specify path to vdev dir\n");
|
|
|
|
(void) fprintf(stderr, " -P print numbers in parseable form\n");
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) fprintf(stderr, " -q don't print label contents\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -t <txg> -- highest txg to use when "
|
|
|
|
"searching for uberblocks\n");
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) fprintf(stderr, " -u uberblock\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -U <cachefile_path> -- use alternate "
|
|
|
|
"cachefile\n");
|
2017-04-14 00:28:46 +03:00
|
|
|
(void) fprintf(stderr, " -V do verbatim import\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -x <dumpdir> -- "
|
|
|
|
"dump all read blocks into specified directory\n");
|
|
|
|
(void) fprintf(stderr, " -X attempt extreme rewind (does not "
|
|
|
|
"work with dataset)\n");
|
2019-01-17 01:10:02 +03:00
|
|
|
(void) fprintf(stderr, " -Y attempt all reconstruction "
|
|
|
|
"combinations for split blocks\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, "Specify an option more than once (e.g. -bb) "
|
|
|
|
"to make only that option verbose\n");
|
|
|
|
(void) fprintf(stderr, "Default is to dump everything non-verbosely\n");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
static void
|
|
|
|
dump_debug_buffer(void)
|
|
|
|
{
|
|
|
|
if (dump_opt['G']) {
|
|
|
|
(void) printf("\n");
|
2018-10-15 22:14:22 +03:00
|
|
|
(void) fflush(stdout);
|
2017-01-28 23:16:43 +03:00
|
|
|
zfs_dbgmsg_print("zdb");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Called for usage errors that are discovered after a call to spa_open(),
|
|
|
|
* dmu_bonus_hold(), or pool_match(). abort() is called for other errors.
|
|
|
|
*/
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
fatal(const char *fmt, ...)
|
|
|
|
{
|
|
|
|
va_list ap;
|
|
|
|
|
|
|
|
va_start(ap, fmt);
|
|
|
|
(void) fprintf(stderr, "%s: ", cmdname);
|
|
|
|
(void) vfprintf(stderr, fmt, ap);
|
|
|
|
va_end(ap);
|
|
|
|
(void) fprintf(stderr, "\n");
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
dump_debug_buffer();
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
exit(1);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
dump_packed_nvlist(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
nvlist_t *nv;
|
|
|
|
size_t nvsize = *(uint64_t *)data;
|
|
|
|
char *packed = umem_alloc(nvsize, UMEM_NOFAIL);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
VERIFY(0 == dmu_read(os, object, 0, nvsize, packed, DMU_READ_PREFETCH));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
VERIFY(nvlist_unpack(packed, nvsize, &nv, 0) == 0);
|
|
|
|
|
|
|
|
umem_free(packed, nvsize);
|
|
|
|
|
|
|
|
dump_nvlist(nv, 8);
|
|
|
|
|
|
|
|
nvlist_free(nv);
|
|
|
|
}
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
dump_history_offsets(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
spa_history_phys_t *shp = data;
|
|
|
|
|
|
|
|
if (shp == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("\t\tpool_create_len = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_pool_create_len);
|
|
|
|
(void) printf("\t\tphys_max_off = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_phys_max_off);
|
|
|
|
(void) printf("\t\tbof = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_bof);
|
|
|
|
(void) printf("\t\teof = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_eof);
|
|
|
|
(void) printf("\t\trecords_lost = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_records_lost);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(uint64_t num, char *buf, size_t buflen)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
if (dump_opt['P'])
|
2017-06-13 12:16:45 +03:00
|
|
|
(void) snprintf(buf, buflen, "%llu", (longlong_t)num);
|
2010-05-29 00:45:14 +04:00
|
|
|
else
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(num, buf, sizeof (buf));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char histo_stars[] = "****************************************";
|
|
|
|
static const uint64_t histo_width = sizeof (histo_stars) - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(const uint64_t *histo, int size, int offset)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int i;
|
2013-03-25 01:24:51 +04:00
|
|
|
int minidx = size - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
int maxidx = 0;
|
|
|
|
uint64_t max = 0;
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
for (i = 0; i < size; i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (histo[i] > max)
|
|
|
|
max = histo[i];
|
|
|
|
if (histo[i] > 0 && i > maxidx)
|
|
|
|
maxidx = i;
|
|
|
|
if (histo[i] > 0 && i < minidx)
|
|
|
|
minidx = i;
|
|
|
|
}
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (max < histo_width)
|
|
|
|
max = histo_width;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
for (i = minidx; i <= maxidx; i++) {
|
|
|
|
(void) printf("\t\t\t%3u: %6llu %s\n",
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
i + offset, (u_longlong_t)histo[i],
|
2013-03-25 01:24:51 +04:00
|
|
|
&histo_stars[(max - histo[i]) * histo_width / max]);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_zap_stats(objset_t *os, uint64_t object)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
zap_stats_t zs;
|
|
|
|
|
|
|
|
error = zap_get_stats(os, object, &zs);
|
|
|
|
if (error)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (zs.zs_ptrtbl_len == 0) {
|
|
|
|
ASSERT(zs.zs_num_blocks == 1);
|
|
|
|
(void) printf("\tmicrozap: %llu bytes, %llu entries\n",
|
|
|
|
(u_longlong_t)zs.zs_blocksize,
|
|
|
|
(u_longlong_t)zs.zs_num_entries);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\tFat ZAP stats:\n");
|
|
|
|
|
|
|
|
(void) printf("\t\tPointer table:\n");
|
|
|
|
(void) printf("\t\t\t%llu elements\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_len);
|
|
|
|
(void) printf("\t\t\tzt_blk: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_zt_blk);
|
|
|
|
(void) printf("\t\t\tzt_numblks: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_zt_numblks);
|
|
|
|
(void) printf("\t\t\tzt_shift: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_zt_shift);
|
|
|
|
(void) printf("\t\t\tzt_blks_copied: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_blks_copied);
|
|
|
|
(void) printf("\t\t\tzt_nextblk: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_nextblk);
|
|
|
|
|
|
|
|
(void) printf("\t\tZAP entries: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_num_entries);
|
|
|
|
(void) printf("\t\tLeaf blocks: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_num_leafs);
|
|
|
|
(void) printf("\t\tTotal blocks: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_num_blocks);
|
|
|
|
(void) printf("\t\tzap_block_type: 0x%llx\n",
|
|
|
|
(u_longlong_t)zs.zs_block_type);
|
|
|
|
(void) printf("\t\tzap_magic: 0x%llx\n",
|
|
|
|
(u_longlong_t)zs.zs_magic);
|
|
|
|
(void) printf("\t\tzap_salt: 0x%llx\n",
|
|
|
|
(u_longlong_t)zs.zs_salt);
|
|
|
|
|
|
|
|
(void) printf("\t\tLeafs with 2^n pointers:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_leafs_with_2n_pointers, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tBlocks with n*5 entries:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_blocks_with_n5_entries, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tBlocks n/10 full:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_blocks_n_tenths_full, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tEntries with n chunks:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_entries_using_n_chunks, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tBuckets with n entries:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_buckets_with_n_entries, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_none(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_unknown(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
(void) printf("\tUNKNOWN OBJECT TYPE\n");
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
2017-10-27 22:46:35 +03:00
|
|
|
static void
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_uint8(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_uint64(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
uint64_t *arr;
|
2019-06-25 22:50:38 +03:00
|
|
|
uint64_t oursize;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (dump_opt['d'] < 6)
|
|
|
|
return;
|
2019-06-25 22:50:38 +03:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (data == NULL) {
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
|
|
|
|
VERIFY0(dmu_object_info(os, object, &doi));
|
|
|
|
size = doi.doi_max_offset;
|
2019-06-25 22:50:38 +03:00
|
|
|
/*
|
|
|
|
* We cap the size at 1 mebibyte here to prevent
|
|
|
|
* allocation failures and nigh-infinite printing if the
|
|
|
|
* object is extremely large.
|
|
|
|
*/
|
|
|
|
oursize = MIN(size, 1 << 20);
|
|
|
|
arr = kmem_alloc(oursize, KM_SLEEP);
|
|
|
|
|
|
|
|
int err = dmu_read(os, object, 0, oursize, arr, 0);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (err != 0) {
|
|
|
|
(void) printf("got error %u from dmu_read\n", err);
|
2019-06-25 22:50:38 +03:00
|
|
|
kmem_free(arr, oursize);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
} else {
|
2019-06-25 22:50:38 +03:00
|
|
|
/*
|
|
|
|
* Even though the allocation is already done in this code path,
|
|
|
|
* we still cap the size to prevent excessive printing.
|
|
|
|
*/
|
|
|
|
oursize = MIN(size, 1 << 20);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
arr = data;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (size == 0) {
|
|
|
|
(void) printf("\t\t[]\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\t\t[%0llx", (u_longlong_t)arr[0]);
|
2019-06-25 22:50:38 +03:00
|
|
|
for (size_t i = 1; i * sizeof (uint64_t) < oursize; i++) {
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (i % 4 != 0)
|
|
|
|
(void) printf(", %0llx", (u_longlong_t)arr[i]);
|
|
|
|
else
|
|
|
|
(void) printf(",\n\t\t%0llx", (u_longlong_t)arr[i]);
|
|
|
|
}
|
2019-06-25 22:50:38 +03:00
|
|
|
if (oursize != size)
|
|
|
|
(void) printf(", ... ");
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
(void) printf("]\n");
|
|
|
|
|
|
|
|
if (data == NULL)
|
2019-06-25 22:50:38 +03:00
|
|
|
kmem_free(arr, oursize);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_zap(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
void *prop;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = ", attr.za_name);
|
|
|
|
if (attr.za_num_integers == 0) {
|
|
|
|
(void) printf("\n");
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
prop = umem_zalloc(attr.za_num_integers *
|
|
|
|
attr.za_integer_length, UMEM_NOFAIL);
|
|
|
|
(void) zap_lookup(os, object, attr.za_name,
|
|
|
|
attr.za_integer_length, attr.za_num_integers, prop);
|
|
|
|
if (attr.za_integer_length == 1) {
|
|
|
|
(void) printf("%s", (char *)prop);
|
|
|
|
} else {
|
|
|
|
for (i = 0; i < attr.za_num_integers; i++) {
|
|
|
|
switch (attr.za_integer_length) {
|
|
|
|
case 2:
|
|
|
|
(void) printf("%u ",
|
|
|
|
((uint16_t *)prop)[i]);
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
(void) printf("%u ",
|
|
|
|
((uint32_t *)prop)[i]);
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
(void) printf("%lld ",
|
|
|
|
(u_longlong_t)((int64_t *)prop)[i]);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
umem_free(prop, attr.za_num_integers * attr.za_integer_length);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
2015-04-27 01:27:36 +03:00
|
|
|
static void
|
|
|
|
dump_bpobj(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
bpobj_phys_t *bpop = data;
|
|
|
|
uint64_t i;
|
|
|
|
char bytes[32], comp[32], uncomp[32];
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure the output won't get truncated */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (comp) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncomp) >= NN_NUMBUF_SZ);
|
|
|
|
|
2015-04-27 01:27:36 +03:00
|
|
|
if (bpop == NULL)
|
|
|
|
return;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bpop->bpo_bytes, bytes, sizeof (bytes));
|
|
|
|
zdb_nicenum(bpop->bpo_comp, comp, sizeof (comp));
|
|
|
|
zdb_nicenum(bpop->bpo_uncomp, uncomp, sizeof (uncomp));
|
2015-04-27 01:27:36 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tnum_blkptrs = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_num_blkptrs);
|
|
|
|
(void) printf("\t\tbytes = %s\n", bytes);
|
|
|
|
if (size >= BPOBJ_SIZE_V1) {
|
|
|
|
(void) printf("\t\tcomp = %s\n", comp);
|
|
|
|
(void) printf("\t\tuncomp = %s\n", uncomp);
|
|
|
|
}
|
2019-07-26 20:54:14 +03:00
|
|
|
if (size >= BPOBJ_SIZE_V2) {
|
2015-04-27 01:27:36 +03:00
|
|
|
(void) printf("\t\tsubobjs = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_subobjs);
|
|
|
|
(void) printf("\t\tnum_subobjs = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_num_subobjs);
|
|
|
|
}
|
2019-07-26 20:54:14 +03:00
|
|
|
if (size >= sizeof (*bpop)) {
|
|
|
|
(void) printf("\t\tnum_freed = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_num_freed);
|
|
|
|
}
|
2015-04-27 01:27:36 +03:00
|
|
|
|
|
|
|
if (dump_opt['d'] < 5)
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (i = 0; i < bpop->bpo_num_blkptrs; i++) {
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
blkptr_t bp;
|
|
|
|
|
|
|
|
int err = dmu_read(os, object,
|
|
|
|
i * sizeof (bp), sizeof (bp), &bp, 0);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) printf("got error %u from dmu_read\n", err);
|
|
|
|
break;
|
|
|
|
}
|
2019-07-26 20:54:14 +03:00
|
|
|
snprintf_blkptr_compact(blkbuf, sizeof (blkbuf), &bp,
|
|
|
|
BP_GET_FREE(&bp));
|
2015-04-27 01:27:36 +03:00
|
|
|
(void) printf("\t%s\n", blkbuf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
dump_bpobj_subobjs(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dmu_object_info_t doi;
|
2015-10-09 21:28:12 +03:00
|
|
|
int64_t i;
|
2015-04-27 01:27:36 +03:00
|
|
|
|
|
|
|
VERIFY0(dmu_object_info(os, object, &doi));
|
|
|
|
uint64_t *subobjs = kmem_alloc(doi.doi_max_offset, KM_SLEEP);
|
|
|
|
|
|
|
|
int err = dmu_read(os, object, 0, doi.doi_max_offset, subobjs, 0);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) printf("got error %u from dmu_read\n", err);
|
|
|
|
kmem_free(subobjs, doi.doi_max_offset);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
int64_t last_nonzero = -1;
|
|
|
|
for (i = 0; i < doi.doi_max_offset / 8; i++) {
|
|
|
|
if (subobjs[i] != 0)
|
|
|
|
last_nonzero = i;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i <= last_nonzero; i++) {
|
2015-10-09 21:28:12 +03:00
|
|
|
(void) printf("\t%llu\n", (u_longlong_t)subobjs[i]);
|
2015-04-27 01:27:36 +03:00
|
|
|
}
|
|
|
|
kmem_free(subobjs, doi.doi_max_offset);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_ddt_zap(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
/* contents are printed elsewhere, properly decoded */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_sa_attrs(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = ", attr.za_name);
|
|
|
|
if (attr.za_num_integers == 0) {
|
|
|
|
(void) printf("\n");
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
(void) printf(" %llx : [%d:%d:%d]\n",
|
|
|
|
(u_longlong_t)attr.za_first_integer,
|
|
|
|
(int)ATTR_LENGTH(attr.za_first_integer),
|
|
|
|
(int)ATTR_BSWAP(attr.za_first_integer),
|
|
|
|
(int)ATTR_NUM(attr.za_first_integer));
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_sa_layouts(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
uint16_t *layout_attrs;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = [", attr.za_name);
|
|
|
|
if (attr.za_num_integers == 0) {
|
|
|
|
(void) printf("\n");
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY(attr.za_integer_length == 2);
|
|
|
|
layout_attrs = umem_zalloc(attr.za_num_integers *
|
|
|
|
attr.za_integer_length, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
VERIFY(zap_lookup(os, object, attr.za_name,
|
|
|
|
attr.za_integer_length,
|
|
|
|
attr.za_num_integers, layout_attrs) == 0);
|
|
|
|
|
|
|
|
for (i = 0; i != attr.za_num_integers; i++)
|
|
|
|
(void) printf(" %d ", (int)layout_attrs[i]);
|
|
|
|
(void) printf("]\n");
|
|
|
|
umem_free(layout_attrs,
|
|
|
|
attr.za_num_integers * attr.za_integer_length);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_zpldir(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
const char *typenames[] = {
|
|
|
|
/* 0 */ "not specified",
|
|
|
|
/* 1 */ "FIFO",
|
|
|
|
/* 2 */ "Character Device",
|
|
|
|
/* 3 */ "3 (invalid)",
|
|
|
|
/* 4 */ "Directory",
|
|
|
|
/* 5 */ "5 (invalid)",
|
|
|
|
/* 6 */ "Block Device",
|
|
|
|
/* 7 */ "7 (invalid)",
|
|
|
|
/* 8 */ "Regular File",
|
|
|
|
/* 9 */ "9 (invalid)",
|
|
|
|
/* 10 */ "Symbolic Link",
|
|
|
|
/* 11 */ "11 (invalid)",
|
|
|
|
/* 12 */ "Socket",
|
|
|
|
/* 13 */ "Door",
|
|
|
|
/* 14 */ "Event Port",
|
|
|
|
/* 15 */ "15 (invalid)",
|
|
|
|
};
|
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = %lld (type: %s)\n",
|
|
|
|
attr.za_name, ZFS_DIRENT_OBJ(attr.za_first_integer),
|
|
|
|
typenames[ZFS_DIRENT_TYPE(attr.za_first_integer)]);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static int
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
get_dtl_refcount(vdev_t *vd)
|
|
|
|
{
|
|
|
|
int refcount = 0;
|
|
|
|
|
|
|
|
if (vd->vdev_ops->vdev_op_leaf) {
|
|
|
|
space_map_t *sm = vd->vdev_dtl_sm;
|
|
|
|
|
|
|
|
if (sm != NULL &&
|
|
|
|
sm->sm_dbuf->db_size == sizeof (space_map_phys_t))
|
|
|
|
return (1);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++)
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
refcount += get_dtl_refcount(vd->vdev_child[c]);
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static int
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
get_metaslab_refcount(vdev_t *vd)
|
|
|
|
{
|
|
|
|
int refcount = 0;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (vd->vdev_top == vd) {
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
space_map_t *sm = vd->vdev_ms[m]->ms_sm;
|
|
|
|
|
|
|
|
if (sm != NULL &&
|
|
|
|
sm->sm_dbuf->db_size == sizeof (space_map_phys_t))
|
|
|
|
refcount++;
|
|
|
|
}
|
|
|
|
}
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++)
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
refcount += get_metaslab_refcount(vd->vdev_child[c]);
|
|
|
|
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static int
|
|
|
|
get_obsolete_refcount(vdev_t *vd)
|
|
|
|
{
|
2018-10-10 01:42:42 +03:00
|
|
|
uint64_t obsolete_sm_object;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
int refcount = 0;
|
|
|
|
|
2018-10-10 01:42:42 +03:00
|
|
|
VERIFY0(vdev_obsolete_sm_object(vd, &obsolete_sm_object));
|
|
|
|
if (vd->vdev_top == vd && obsolete_sm_object != 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dmu_object_info_t doi;
|
|
|
|
VERIFY0(dmu_object_info(vd->vdev_spa->spa_meta_objset,
|
2018-10-10 01:42:42 +03:00
|
|
|
obsolete_sm_object, &doi));
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (doi.doi_bonus_size == sizeof (space_map_phys_t)) {
|
|
|
|
refcount++;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ASSERT3P(vd->vdev_obsolete_sm, ==, NULL);
|
2018-10-10 01:42:42 +03:00
|
|
|
ASSERT3U(obsolete_sm_object, ==, 0);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++) {
|
|
|
|
refcount += get_obsolete_refcount(vd->vdev_child[c]);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
get_prev_obsolete_spacemap_refcount(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t prev_obj =
|
|
|
|
spa->spa_condensing_indirect_phys.scip_prev_obsolete_sm_object;
|
|
|
|
if (prev_obj != 0) {
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
VERIFY0(dmu_object_info(spa->spa_meta_objset, prev_obj, &doi));
|
|
|
|
if (doi.doi_bonus_size == sizeof (space_map_phys_t)) {
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
static int
|
|
|
|
get_checkpoint_refcount(vdev_t *vd)
|
|
|
|
{
|
|
|
|
int refcount = 0;
|
|
|
|
|
|
|
|
if (vd->vdev_top == vd && vd->vdev_top_zap != 0 &&
|
|
|
|
zap_contains(spa_meta_objset(vd->vdev_spa),
|
|
|
|
vd->vdev_top_zap, VDEV_TOP_ZAP_POOL_CHECKPOINT_SM) == 0)
|
|
|
|
refcount++;
|
|
|
|
|
|
|
|
for (uint64_t c = 0; c < vd->vdev_children; c++)
|
|
|
|
refcount += get_checkpoint_refcount(vd->vdev_child[c]);
|
|
|
|
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static int
|
|
|
|
get_log_spacemap_refcount(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (avl_numnodes(&spa->spa_sm_logs_by_txg));
|
|
|
|
}
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
static int
|
|
|
|
verify_spacemap_refcounts(spa_t *spa)
|
|
|
|
{
|
2013-10-08 21:13:05 +04:00
|
|
|
uint64_t expected_refcount = 0;
|
|
|
|
uint64_t actual_refcount;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
2013-10-08 21:13:05 +04:00
|
|
|
(void) feature_get_refcount(spa,
|
|
|
|
&spa_feature_table[SPA_FEATURE_SPACEMAP_HISTOGRAM],
|
|
|
|
&expected_refcount);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
actual_refcount = get_dtl_refcount(spa->spa_root_vdev);
|
|
|
|
actual_refcount += get_metaslab_refcount(spa->spa_root_vdev);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
actual_refcount += get_obsolete_refcount(spa->spa_root_vdev);
|
|
|
|
actual_refcount += get_prev_obsolete_spacemap_refcount(spa);
|
2016-12-17 01:11:29 +03:00
|
|
|
actual_refcount += get_checkpoint_refcount(spa->spa_root_vdev);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
actual_refcount += get_log_spacemap_refcount(spa);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
|
|
|
if (expected_refcount != actual_refcount) {
|
2013-10-08 21:13:05 +04:00
|
|
|
(void) printf("space map refcount mismatch: expected %lld != "
|
|
|
|
"actual %lld\n",
|
|
|
|
(longlong_t)expected_refcount,
|
|
|
|
(longlong_t)actual_refcount);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
return (2);
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_spacemap(objset_t *os, space_map_t *sm)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *ddata[] = { "ALLOC", "FREE", "CONDENSE", "INVALID",
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
"INVALID", "INVALID", "INVALID", "INVALID" };
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (sm == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
(void) printf("space map object %llu:\n",
|
2019-02-12 21:38:11 +03:00
|
|
|
(longlong_t)sm->sm_object);
|
|
|
|
(void) printf(" smp_length = 0x%llx\n",
|
|
|
|
(longlong_t)sm->sm_phys->smp_length);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
(void) printf(" smp_alloc = 0x%llx\n",
|
|
|
|
(longlong_t)sm->sm_phys->smp_alloc);
|
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
if (dump_opt['d'] < 6 && dump_opt['m'] < 4)
|
|
|
|
return;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Print out the freelist entries in both encoded and decoded form.
|
|
|
|
*/
|
2017-08-04 19:30:49 +03:00
|
|
|
uint8_t mapshift = sm->sm_shift;
|
|
|
|
int64_t alloc = 0;
|
2019-01-30 20:54:27 +03:00
|
|
|
uint64_t word, entry_id = 0;
|
2017-08-04 19:30:49 +03:00
|
|
|
for (uint64_t offset = 0; offset < space_map_length(sm);
|
|
|
|
offset += sizeof (word)) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
|
|
|
VERIFY0(dmu_read(os, space_map_object(sm), offset,
|
2017-08-04 19:30:49 +03:00
|
|
|
sizeof (word), &word, DMU_READ_PREFETCH));
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
if (sm_entry_is_debug(word)) {
|
2019-01-30 20:54:27 +03:00
|
|
|
(void) printf("\t [%6llu] %s: txg %llu pass %llu\n",
|
|
|
|
(u_longlong_t)entry_id,
|
2017-08-04 19:30:49 +03:00
|
|
|
ddata[SM_DEBUG_ACTION_DECODE(word)],
|
|
|
|
(u_longlong_t)SM_DEBUG_TXG_DECODE(word),
|
|
|
|
(u_longlong_t)SM_DEBUG_SYNCPASS_DECODE(word));
|
2019-01-30 20:54:27 +03:00
|
|
|
entry_id++;
|
2017-08-04 19:30:49 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint8_t words;
|
|
|
|
char entry_type;
|
|
|
|
uint64_t entry_off, entry_run, entry_vdev = SM_NO_VDEVID;
|
|
|
|
|
|
|
|
if (sm_entry_is_single_word(word)) {
|
|
|
|
entry_type = (SM_TYPE_DECODE(word) == SM_ALLOC) ?
|
|
|
|
'A' : 'F';
|
|
|
|
entry_off = (SM_OFFSET_DECODE(word) << mapshift) +
|
|
|
|
sm->sm_start;
|
|
|
|
entry_run = SM_RUN_DECODE(word) << mapshift;
|
|
|
|
words = 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2017-08-04 19:30:49 +03:00
|
|
|
/* it is a two-word entry so we read another word */
|
|
|
|
ASSERT(sm_entry_is_double_word(word));
|
|
|
|
|
|
|
|
uint64_t extra_word;
|
|
|
|
offset += sizeof (extra_word);
|
|
|
|
VERIFY0(dmu_read(os, space_map_object(sm), offset,
|
|
|
|
sizeof (extra_word), &extra_word,
|
|
|
|
DMU_READ_PREFETCH));
|
|
|
|
|
|
|
|
ASSERT3U(offset, <=, space_map_length(sm));
|
|
|
|
|
|
|
|
entry_run = SM2_RUN_DECODE(word) << mapshift;
|
|
|
|
entry_vdev = SM2_VDEV_DECODE(word);
|
|
|
|
entry_type = (SM2_TYPE_DECODE(extra_word) == SM_ALLOC) ?
|
|
|
|
'A' : 'F';
|
|
|
|
entry_off = (SM2_OFFSET_DECODE(extra_word) <<
|
|
|
|
mapshift) + sm->sm_start;
|
|
|
|
words = 2;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2017-08-04 19:30:49 +03:00
|
|
|
|
|
|
|
(void) printf("\t [%6llu] %c range:"
|
|
|
|
" %010llx-%010llx size: %06llx vdev: %06llu words: %u\n",
|
2019-01-30 20:54:27 +03:00
|
|
|
(u_longlong_t)entry_id,
|
2017-08-04 19:30:49 +03:00
|
|
|
entry_type, (u_longlong_t)entry_off,
|
|
|
|
(u_longlong_t)(entry_off + entry_run),
|
|
|
|
(u_longlong_t)entry_run,
|
|
|
|
(u_longlong_t)entry_vdev, words);
|
|
|
|
|
|
|
|
if (entry_type == 'A')
|
|
|
|
alloc += entry_run;
|
|
|
|
else
|
|
|
|
alloc -= entry_run;
|
2019-01-30 20:54:27 +03:00
|
|
|
entry_id++;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (alloc != space_map_allocated(sm)) {
|
2017-08-04 19:30:49 +03:00
|
|
|
(void) printf("space_map_object alloc (%lld) INCONSISTENT "
|
|
|
|
"with space map summary (%lld)\n",
|
|
|
|
(longlong_t)space_map_allocated(sm), (longlong_t)alloc);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
static void
|
|
|
|
dump_metaslab_stats(metaslab_t *msp)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
char maxbuf[32];
|
2016-12-17 01:11:29 +03:00
|
|
|
range_tree_t *rt = msp->ms_allocatable;
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
zfs_btree_t *t = &msp->ms_allocatable_by_size;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
int free_pct = range_tree_space(rt) * 100 / msp->ms_size;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* max sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (maxbuf) >= NN_NUMBUF_SZ);
|
|
|
|
|
Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.
This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
2019-08-06 00:34:27 +03:00
|
|
|
zdb_nicenum(metaslab_largest_allocatable(msp), maxbuf, sizeof (maxbuf));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\t %25s %10lu %7s %6s %4s %4d%%\n",
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
"segments", zfs_btree_numnodes(t), "maxsize", maxbuf,
|
2009-07-03 02:44:48 +04:00
|
|
|
"freepct", free_pct);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
(void) printf("\tIn-memory histogram:\n");
|
|
|
|
dump_histogram(rt->rt_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_metaslab(metaslab_t *msp)
|
|
|
|
{
|
|
|
|
vdev_t *vd = msp->ms_group->mg_vd;
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
space_map_t *sm = msp->ms_sm;
|
2010-05-29 00:45:14 +04:00
|
|
|
char freebuf[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(msp->ms_size - space_map_allocated(sm), freebuf,
|
|
|
|
sizeof (freebuf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf(
|
2010-05-29 00:45:14 +04:00
|
|
|
"\tmetaslab %6llu offset %12llx spacemap %6llu free %5s\n",
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
(u_longlong_t)msp->ms_id, (u_longlong_t)msp->ms_start,
|
|
|
|
(u_longlong_t)space_map_object(sm), freebuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (dump_opt['m'] > 2 && !dump_opt['L']) {
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_enter(&msp->ms_lock);
|
2019-01-18 22:10:32 +03:00
|
|
|
VERIFY0(metaslab_load(msp));
|
|
|
|
range_tree_stat_verify(msp->ms_allocatable);
|
2009-07-03 02:44:48 +04:00
|
|
|
dump_metaslab_stats(msp);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
metaslab_unload(msp);
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_exit(&msp->ms_lock);
|
|
|
|
}
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (dump_opt['m'] > 1 && sm != NULL &&
|
2013-10-08 21:13:05 +04:00
|
|
|
spa_feature_is_active(spa, SPA_FEATURE_SPACEMAP_HISTOGRAM)) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
/*
|
|
|
|
* The space map histogram represents free space in chunks
|
|
|
|
* of sm_shift (i.e. bucket 0 refers to 2^sm_shift).
|
|
|
|
*/
|
2014-07-20 00:19:24 +04:00
|
|
|
(void) printf("\tOn-disk histogram:\t\tfragmentation %llu\n",
|
|
|
|
(u_longlong_t)msp->ms_fragmentation);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(sm->sm_phys->smp_histogram,
|
2014-07-20 00:19:24 +04:00
|
|
|
SPACE_MAP_HISTOGRAM_SIZE, sm->sm_shift);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
}
|
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
ASSERT(msp->ms_size == (1ULL << vd->vdev_ms_shift));
|
|
|
|
dump_spacemap(spa->spa_meta_objset, msp->ms_sm);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
|
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP)) {
|
|
|
|
(void) printf("\tFlush data:\n\tunflushed txg=%llu\n\n",
|
|
|
|
(u_longlong_t)metaslab_unflushed_txg(msp));
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
print_vdev_metaslab_header(vdev_t *vd)
|
|
|
|
{
|
2018-09-06 04:33:36 +03:00
|
|
|
vdev_alloc_bias_t alloc_bias = vd->vdev_alloc_bias;
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
const char *bias_str = "";
|
|
|
|
if (alloc_bias == VDEV_BIAS_LOG || vd->vdev_islog) {
|
|
|
|
bias_str = VDEV_ALLOC_BIAS_LOG;
|
|
|
|
} else if (alloc_bias == VDEV_BIAS_SPECIAL) {
|
|
|
|
bias_str = VDEV_ALLOC_BIAS_SPECIAL;
|
|
|
|
} else if (alloc_bias == VDEV_BIAS_DEDUP) {
|
|
|
|
bias_str = VDEV_ALLOC_BIAS_DEDUP;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t ms_flush_data_obj = 0;
|
|
|
|
if (vd->vdev_top_zap != 0) {
|
|
|
|
int error = zap_lookup(spa_meta_objset(vd->vdev_spa),
|
|
|
|
vd->vdev_top_zap, VDEV_TOP_ZAP_MS_UNFLUSHED_PHYS_TXGS,
|
|
|
|
sizeof (uint64_t), 1, &ms_flush_data_obj);
|
|
|
|
if (error != ENOENT) {
|
|
|
|
ASSERT0(error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\tvdev %10llu %s",
|
|
|
|
(u_longlong_t)vd->vdev_id, bias_str);
|
2018-09-06 04:33:36 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (ms_flush_data_obj != 0) {
|
|
|
|
(void) printf(" ms_unflushed_phys object %llu",
|
|
|
|
(u_longlong_t)ms_flush_data_obj);
|
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
(void) printf("\n\t%-10s%5llu %-19s %-15s %-12s\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
"metaslabs", (u_longlong_t)vd->vdev_ms_count,
|
|
|
|
"offset", "spacemap", "free");
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%15s %19s %15s %12s\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
"---------------", "-------------------",
|
2018-09-06 04:33:36 +03:00
|
|
|
"---------------", "------------");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-07-20 00:19:24 +04:00
|
|
|
static void
|
|
|
|
dump_metaslab_groups(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
metaslab_class_t *mc = spa_normal_class(spa);
|
|
|
|
uint64_t fragmentation;
|
|
|
|
|
|
|
|
metaslab_class_histogram_verify(mc);
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < rvd->vdev_children; c++) {
|
2014-07-20 00:19:24 +04:00
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
|
|
|
metaslab_group_t *mg = tvd->vdev_mg;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
if (mg == NULL || mg->mg_class != mc)
|
2014-07-20 00:19:24 +04:00
|
|
|
continue;
|
|
|
|
|
|
|
|
metaslab_group_histogram_verify(mg);
|
|
|
|
mg->mg_fragmentation = metaslab_group_fragmentation(mg);
|
|
|
|
|
|
|
|
(void) printf("\tvdev %10llu\t\tmetaslabs%5llu\t\t"
|
|
|
|
"fragmentation",
|
|
|
|
(u_longlong_t)tvd->vdev_id,
|
|
|
|
(u_longlong_t)tvd->vdev_ms_count);
|
|
|
|
if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
|
|
|
|
(void) printf("%3s\n", "-");
|
|
|
|
} else {
|
|
|
|
(void) printf("%3llu%%\n",
|
|
|
|
(u_longlong_t)mg->mg_fragmentation);
|
|
|
|
}
|
|
|
|
dump_histogram(mg->mg_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\tpool %s\tfragmentation", spa_name(spa));
|
|
|
|
fragmentation = metaslab_class_fragmentation(mc);
|
|
|
|
if (fragmentation == ZFS_FRAG_INVALID)
|
|
|
|
(void) printf("\t%3s\n", "-");
|
|
|
|
else
|
|
|
|
(void) printf("\t%3llu%%\n", (u_longlong_t)fragmentation);
|
|
|
|
dump_histogram(mc->mc_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static void
|
|
|
|
print_vdev_indirect(vdev_t *vd)
|
|
|
|
{
|
|
|
|
vdev_indirect_config_t *vic = &vd->vdev_indirect_config;
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
vdev_indirect_births_t *vib = vd->vdev_indirect_births;
|
|
|
|
|
|
|
|
if (vim == NULL) {
|
|
|
|
ASSERT3P(vib, ==, NULL);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT3U(vdev_indirect_mapping_object(vim), ==,
|
|
|
|
vic->vic_mapping_object);
|
|
|
|
ASSERT3U(vdev_indirect_births_object(vib), ==,
|
|
|
|
vic->vic_births_object);
|
|
|
|
|
|
|
|
(void) printf("indirect births obj %llu:\n",
|
|
|
|
(longlong_t)vic->vic_births_object);
|
|
|
|
(void) printf(" vib_count = %llu\n",
|
|
|
|
(longlong_t)vdev_indirect_births_count(vib));
|
|
|
|
for (uint64_t i = 0; i < vdev_indirect_births_count(vib); i++) {
|
|
|
|
vdev_indirect_birth_entry_phys_t *cur_vibe =
|
|
|
|
&vib->vib_entries[i];
|
|
|
|
(void) printf("\toffset %llx -> txg %llu\n",
|
|
|
|
(longlong_t)cur_vibe->vibe_offset,
|
|
|
|
(longlong_t)cur_vibe->vibe_phys_birth_txg);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
(void) printf("indirect mapping obj %llu:\n",
|
|
|
|
(longlong_t)vic->vic_mapping_object);
|
|
|
|
(void) printf(" vim_max_offset = 0x%llx\n",
|
|
|
|
(longlong_t)vdev_indirect_mapping_max_offset(vim));
|
|
|
|
(void) printf(" vim_bytes_mapped = 0x%llx\n",
|
|
|
|
(longlong_t)vdev_indirect_mapping_bytes_mapped(vim));
|
|
|
|
(void) printf(" vim_count = %llu\n",
|
|
|
|
(longlong_t)vdev_indirect_mapping_num_entries(vim));
|
|
|
|
|
|
|
|
if (dump_opt['d'] <= 5 && dump_opt['m'] <= 3)
|
|
|
|
return;
|
|
|
|
|
|
|
|
uint32_t *counts = vdev_indirect_mapping_load_obsolete_counts(vim);
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < vdev_indirect_mapping_num_entries(vim); i++) {
|
|
|
|
vdev_indirect_mapping_entry_phys_t *vimep =
|
|
|
|
&vim->vim_entries[i];
|
|
|
|
(void) printf("\t<%llx:%llx:%llx> -> "
|
|
|
|
"<%llx:%llx:%llx> (%x obsolete)\n",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)DVA_MAPPING_GET_SRC_OFFSET(vimep),
|
|
|
|
(longlong_t)DVA_GET_ASIZE(&vimep->vimep_dst),
|
|
|
|
(longlong_t)DVA_GET_VDEV(&vimep->vimep_dst),
|
|
|
|
(longlong_t)DVA_GET_OFFSET(&vimep->vimep_dst),
|
|
|
|
(longlong_t)DVA_GET_ASIZE(&vimep->vimep_dst),
|
|
|
|
counts[i]);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
|
2018-10-10 01:42:42 +03:00
|
|
|
uint64_t obsolete_sm_object;
|
|
|
|
VERIFY0(vdev_obsolete_sm_object(vd, &obsolete_sm_object));
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (obsolete_sm_object != 0) {
|
|
|
|
objset_t *mos = vd->vdev_spa->spa_meta_objset;
|
|
|
|
(void) printf("obsolete space map object %llu:\n",
|
|
|
|
(u_longlong_t)obsolete_sm_object);
|
|
|
|
ASSERT(vd->vdev_obsolete_sm != NULL);
|
|
|
|
ASSERT3U(space_map_object(vd->vdev_obsolete_sm), ==,
|
|
|
|
obsolete_sm_object);
|
|
|
|
dump_spacemap(mos, vd->vdev_obsolete_sm);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_metaslabs(spa_t *spa)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *vd, *rvd = spa->spa_root_vdev;
|
|
|
|
uint64_t m, c = 0, children = rvd->vdev_children;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\nMetaslabs:\n");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['d'] && zopt_objects > 0) {
|
|
|
|
c = zopt_object[0];
|
|
|
|
|
|
|
|
if (c >= children)
|
|
|
|
(void) fatal("bad vdev id: %llu", (u_longlong_t)c);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zopt_objects > 1) {
|
|
|
|
vd = rvd->vdev_child[c];
|
|
|
|
print_vdev_metaslab_header(vd);
|
|
|
|
|
|
|
|
for (m = 1; m < zopt_objects; m++) {
|
|
|
|
if (zopt_object[m] < vd->vdev_ms_count)
|
|
|
|
dump_metaslab(
|
|
|
|
vd->vdev_ms[zopt_object[m]]);
|
|
|
|
else
|
|
|
|
(void) fprintf(stderr, "bad metaslab "
|
|
|
|
"number %llu\n",
|
|
|
|
(u_longlong_t)zopt_object[m]);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
children = c + 1;
|
|
|
|
}
|
|
|
|
for (; c < children; c++) {
|
|
|
|
vd = rvd->vdev_child[c];
|
|
|
|
print_vdev_metaslab_header(vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
print_vdev_indirect(vd);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
for (m = 0; m < vd->vdev_ms_count; m++)
|
|
|
|
dump_metaslab(vd->vdev_ms[m]);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static void
|
|
|
|
dump_log_spacemaps(spa_t *spa)
|
|
|
|
{
|
2019-07-18 22:54:03 +03:00
|
|
|
if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP))
|
|
|
|
return;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
(void) printf("\nLog Space Maps in Pool:\n");
|
|
|
|
for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg);
|
|
|
|
sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) {
|
|
|
|
space_map_t *sm = NULL;
|
|
|
|
VERIFY0(space_map_open(&sm, spa_meta_objset(spa),
|
|
|
|
sls->sls_sm_obj, 0, UINT64_MAX, SPA_MINBLOCKSHIFT));
|
|
|
|
|
|
|
|
(void) printf("Log Spacemap object %llu txg %llu\n",
|
|
|
|
(u_longlong_t)sls->sls_sm_obj, (u_longlong_t)sls->sls_txg);
|
|
|
|
dump_spacemap(spa->spa_meta_objset, sm);
|
|
|
|
space_map_close(sm);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
dump_dde(const ddt_t *ddt, const ddt_entry_t *dde, uint64_t index)
|
|
|
|
{
|
|
|
|
const ddt_phys_t *ddp = dde->dde_phys;
|
|
|
|
const ddt_key_t *ddk = &dde->dde_key;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *types[4] = { "ditto", "single", "double", "triple" };
|
2010-05-29 00:45:14 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
blkptr_t blk;
|
2010-08-26 20:52:39 +04:00
|
|
|
int p;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ddp->ddp_phys_birth == 0)
|
|
|
|
continue;
|
|
|
|
ddt_bp_create(ddt->ddt_checksum, ddk, ddp, &blk);
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), &blk);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("index %llx refcnt %llu %s %s\n",
|
|
|
|
(u_longlong_t)index, (u_longlong_t)ddp->ddp_refcnt,
|
|
|
|
types[p], blkbuf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_dedup_ratio(const ddt_stat_t *dds)
|
|
|
|
{
|
|
|
|
double rL, rP, rD, D, dedup, compress, copies;
|
|
|
|
|
|
|
|
if (dds->dds_blocks == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
rL = (double)dds->dds_ref_lsize;
|
|
|
|
rP = (double)dds->dds_ref_psize;
|
|
|
|
rD = (double)dds->dds_ref_dsize;
|
|
|
|
D = (double)dds->dds_dsize;
|
|
|
|
|
|
|
|
dedup = rD / D;
|
|
|
|
compress = rL / rP;
|
|
|
|
copies = rD / rP;
|
|
|
|
|
|
|
|
(void) printf("dedup = %.2f, compress = %.2f, copies = %.2f, "
|
|
|
|
"dedup * compress / copies = %.2f\n\n",
|
|
|
|
dedup, compress, copies, dedup * compress / copies);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_ddt(ddt_t *ddt, enum ddt_type type, enum ddt_class class)
|
|
|
|
{
|
|
|
|
char name[DDT_NAMELEN];
|
|
|
|
ddt_entry_t dde;
|
|
|
|
uint64_t walk = 0;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
uint64_t count, dspace, mspace;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = ddt_object_info(ddt, type, class, &doi);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
|
|
|
return;
|
|
|
|
ASSERT(error == 0);
|
|
|
|
|
2012-10-26 21:01:49 +04:00
|
|
|
error = ddt_object_count(ddt, type, class, &count);
|
|
|
|
ASSERT(error == 0);
|
|
|
|
if (count == 0)
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dspace = doi.doi_physical_blocks_512 << 9;
|
|
|
|
mspace = doi.doi_fill_count * doi.doi_data_block_size;
|
|
|
|
|
|
|
|
ddt_object_name(ddt, type, class, name);
|
|
|
|
|
|
|
|
(void) printf("%s: %llu entries, size %llu on disk, %llu in core\n",
|
|
|
|
name,
|
|
|
|
(u_longlong_t)count,
|
|
|
|
(u_longlong_t)(dspace / count),
|
|
|
|
(u_longlong_t)(mspace / count));
|
|
|
|
|
|
|
|
if (dump_opt['D'] < 3)
|
|
|
|
return;
|
|
|
|
|
|
|
|
zpool_dump_ddt(NULL, &ddt->ddt_histogram[type][class]);
|
|
|
|
|
|
|
|
if (dump_opt['D'] < 4)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (dump_opt['D'] < 5 && class == DDT_CLASS_UNIQUE)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("%s contents:\n\n", name);
|
|
|
|
|
|
|
|
while ((error = ddt_object_walk(ddt, type, class, &walk, &dde)) == 0)
|
|
|
|
dump_dde(ddt, &dde, walk);
|
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
ASSERT3U(error, ==, ENOENT);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_all_ddts(spa_t *spa)
|
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
ddt_histogram_t ddh_total;
|
|
|
|
ddt_stat_t dds_total;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&ddh_total, sizeof (ddh_total));
|
|
|
|
bzero(&dds_total, sizeof (dds_total));
|
2010-08-26 20:52:39 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (enum zio_checksum c = 0; c < ZIO_CHECKSUM_FUNCTIONS; c++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_t *ddt = spa->spa_ddt[c];
|
2017-10-27 22:46:35 +03:00
|
|
|
for (enum ddt_type type = 0; type < DDT_TYPES; type++) {
|
|
|
|
for (enum ddt_class class = 0; class < DDT_CLASSES;
|
2010-05-29 00:45:14 +04:00
|
|
|
class++) {
|
|
|
|
dump_ddt(ddt, type, class);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ddt_get_dedup_stats(spa, &dds_total);
|
|
|
|
|
|
|
|
if (dds_total.dds_blocks == 0) {
|
|
|
|
(void) printf("All DDTs are empty\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
if (dump_opt['D'] > 1) {
|
|
|
|
(void) printf("DDT histogram (aggregated over all DDTs):\n");
|
|
|
|
ddt_get_dedup_histogram(spa, &ddh_total);
|
|
|
|
zpool_dump_ddt(&dds_total, &ddh_total);
|
|
|
|
}
|
|
|
|
|
|
|
|
dump_dedup_ratio(&dds_total);
|
|
|
|
}
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_dtl_seg(void *arg, uint64_t start, uint64_t size)
|
2009-01-16 00:59:39 +03:00
|
|
|
{
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
char *prefix = arg;
|
2009-01-16 00:59:39 +03:00
|
|
|
|
|
|
|
(void) printf("%s [%llu,%llu) length %llu\n",
|
|
|
|
prefix,
|
|
|
|
(u_longlong_t)start,
|
|
|
|
(u_longlong_t)(start + size),
|
|
|
|
(u_longlong_t)(size));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_dtl(vdev_t *vd, int indent)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
boolean_t required;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *name[DTL_TYPES] = { "missing", "partial", "scrub",
|
|
|
|
"outage" };
|
2009-01-16 00:59:39 +03:00
|
|
|
char prefix[256];
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_vdev_state_enter(spa, SCL_NONE);
|
2009-01-16 00:59:39 +03:00
|
|
|
required = vdev_dtl_required(vd);
|
|
|
|
(void) spa_vdev_state_exit(spa, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (indent == 0)
|
|
|
|
(void) printf("\nDirty time logs:\n\n");
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) printf("\t%*s%s [%s]\n", indent, "",
|
2008-12-03 23:09:06 +03:00
|
|
|
vd->vdev_path ? vd->vdev_path :
|
2009-01-16 00:59:39 +03:00
|
|
|
vd->vdev_parent ? vd->vdev_ops->vdev_op_type : spa_name(spa),
|
|
|
|
required ? "DTL-required" : "DTL-expendable");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (int t = 0; t < DTL_TYPES; t++) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
range_tree_t *rt = vd->vdev_dtl[t];
|
|
|
|
if (range_tree_space(rt) == 0)
|
2009-01-16 00:59:39 +03:00
|
|
|
continue;
|
|
|
|
(void) snprintf(prefix, sizeof (prefix), "\t%*s%s",
|
|
|
|
indent + 2, "", name[t]);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
range_tree_walk(rt, dump_dtl_seg, prefix);
|
2009-01-16 00:59:39 +03:00
|
|
|
if (dump_opt['d'] > 5 && vd->vdev_children == 0)
|
|
|
|
dump_spacemap(spa->spa_meta_objset,
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
vd->vdev_dtl_sm);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++)
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_dtl(vd->vdev_child[c], indent + 4);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
dump_history(spa_t *spa)
|
|
|
|
{
|
|
|
|
nvlist_t **events = NULL;
|
2015-06-24 21:17:36 +03:00
|
|
|
char *buf;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t resid, len, off = 0;
|
|
|
|
uint_t num = 0;
|
|
|
|
int error;
|
|
|
|
time_t tsec;
|
|
|
|
struct tm t;
|
|
|
|
char tbuf[30];
|
|
|
|
char internalstr[MAXPATHLEN];
|
|
|
|
|
2015-06-24 21:17:36 +03:00
|
|
|
if ((buf = malloc(SPA_OLD_MAXBLOCKSIZE)) == NULL) {
|
|
|
|
(void) fprintf(stderr, "%s: unable to allocate I/O buffer\n",
|
|
|
|
__func__);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
do {
|
2015-06-24 21:17:36 +03:00
|
|
|
len = SPA_OLD_MAXBLOCKSIZE;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if ((error = spa_history_get(spa, &off, &len, buf)) != 0) {
|
|
|
|
(void) fprintf(stderr, "Unable to read history: "
|
|
|
|
"error %d\n", error);
|
2015-06-24 21:17:36 +03:00
|
|
|
free(buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zpool_history_unpack(buf, len, &resid, &events, &num) != 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
off -= resid;
|
|
|
|
} while (len != 0);
|
|
|
|
|
|
|
|
(void) printf("\nHistory:\n");
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned i = 0; i < num; i++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t time, txg, ievent;
|
|
|
|
char *cmd, *intstr;
|
2013-08-28 15:45:09 +04:00
|
|
|
boolean_t printed = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (nvlist_lookup_uint64(events[i], ZPOOL_HIST_TIME,
|
|
|
|
&time) != 0)
|
2013-08-28 15:45:09 +04:00
|
|
|
goto next;
|
2010-05-29 00:45:14 +04:00
|
|
|
if (nvlist_lookup_string(events[i], ZPOOL_HIST_CMD,
|
|
|
|
&cmd) != 0) {
|
|
|
|
if (nvlist_lookup_uint64(events[i],
|
|
|
|
ZPOOL_HIST_INT_EVENT, &ievent) != 0)
|
2013-08-28 15:45:09 +04:00
|
|
|
goto next;
|
2010-05-29 00:45:14 +04:00
|
|
|
verify(nvlist_lookup_uint64(events[i],
|
|
|
|
ZPOOL_HIST_TXG, &txg) == 0);
|
|
|
|
verify(nvlist_lookup_string(events[i],
|
|
|
|
ZPOOL_HIST_INT_STR, &intstr) == 0);
|
2013-08-28 15:45:09 +04:00
|
|
|
if (ievent >= ZFS_NUM_LEGACY_HISTORY_EVENTS)
|
|
|
|
goto next;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) snprintf(internalstr,
|
|
|
|
sizeof (internalstr),
|
|
|
|
"[internal %s txg:%lld] %s",
|
2010-08-26 20:52:39 +04:00
|
|
|
zfs_history_event_names[ievent],
|
|
|
|
(longlong_t)txg, intstr);
|
2010-05-29 00:45:14 +04:00
|
|
|
cmd = internalstr;
|
|
|
|
}
|
|
|
|
tsec = time;
|
|
|
|
(void) localtime_r(&tsec, &t);
|
|
|
|
(void) strftime(tbuf, sizeof (tbuf), "%F.%T", &t);
|
|
|
|
(void) printf("%s %s\n", tbuf, cmd);
|
2013-08-28 15:45:09 +04:00
|
|
|
printed = B_TRUE;
|
|
|
|
|
|
|
|
next:
|
|
|
|
if (dump_opt['h'] > 1) {
|
|
|
|
if (!printed)
|
|
|
|
(void) printf("unrecognized record:\n");
|
|
|
|
dump_nvlist(events[i], 2);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2015-06-24 21:17:36 +03:00
|
|
|
free(buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dnode(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
2014-06-25 22:37:59 +04:00
|
|
|
blkid2offset(const dnode_phys_t *dnp, const blkptr_t *bp,
|
|
|
|
const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dnp == NULL) {
|
|
|
|
ASSERT(zb->zb_level < 0);
|
|
|
|
if (zb->zb_object == 0)
|
|
|
|
return (zb->zb_blkid);
|
|
|
|
return (zb->zb_blkid * BP_GET_LSIZE(bp));
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(zb->zb_level >= 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return ((zb->zb_blkid <<
|
|
|
|
(zb->zb_level * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT))) *
|
2008-11-20 23:01:55 +03:00
|
|
|
dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
snprintf_blkptr_compact(char *blkbuf, size_t buflen, const blkptr_t *bp,
|
|
|
|
boolean_t bp_freed)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
const dva_t *dva = bp->blk_dva;
|
|
|
|
int ndvas = dump_opt['d'] > 5 ? BP_GET_NDVAS(bp) : 1;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (dump_opt['b'] >= 6) {
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, buflen, bp);
|
2019-07-26 20:54:14 +03:00
|
|
|
if (bp_freed) {
|
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
|
|
|
buflen - strlen(blkbuf), " %s", "FREE");
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_EMBEDDED(bp)) {
|
|
|
|
(void) sprintf(blkbuf,
|
|
|
|
"EMBEDDED et=%u %llxL/%llxP B=%llu",
|
|
|
|
(int)BPE_GET_ETYPE(bp),
|
|
|
|
(u_longlong_t)BPE_GET_LSIZE(bp),
|
|
|
|
(u_longlong_t)BPE_GET_PSIZE(bp),
|
|
|
|
(u_longlong_t)bp->blk_birth);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
blkbuf[0] = '\0';
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < ndvas; i++)
|
2013-12-09 22:37:51 +04:00
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
|
|
|
buflen - strlen(blkbuf), "%llu:%llx:%llx ",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)DVA_GET_VDEV(&dva[i]),
|
|
|
|
(u_longlong_t)DVA_GET_OFFSET(&dva[i]),
|
|
|
|
(u_longlong_t)DVA_GET_ASIZE(&dva[i]));
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_IS_HOLE(bp)) {
|
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
2015-03-27 05:03:22 +03:00
|
|
|
buflen - strlen(blkbuf),
|
|
|
|
"%llxL B=%llu",
|
|
|
|
(u_longlong_t)BP_GET_LSIZE(bp),
|
2013-12-09 22:37:51 +04:00
|
|
|
(u_longlong_t)bp->blk_birth);
|
|
|
|
} else {
|
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
|
|
|
buflen - strlen(blkbuf),
|
|
|
|
"%llxL/%llxP F=%llu B=%llu/%llu",
|
|
|
|
(u_longlong_t)BP_GET_LSIZE(bp),
|
|
|
|
(u_longlong_t)BP_GET_PSIZE(bp),
|
2014-06-06 01:19:08 +04:00
|
|
|
(u_longlong_t)BP_GET_FILL(bp),
|
2013-12-09 22:37:51 +04:00
|
|
|
(u_longlong_t)bp->blk_birth,
|
|
|
|
(u_longlong_t)BP_PHYSICAL_BIRTH(bp));
|
2019-07-26 20:54:14 +03:00
|
|
|
if (bp_freed)
|
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
|
|
|
buflen - strlen(blkbuf), " %s", "FREE");
|
2013-12-09 22:37:51 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
static void
|
2014-06-25 22:37:59 +04:00
|
|
|
print_indirect(blkptr_t *bp, const zbookmark_phys_t *zb,
|
2008-12-03 23:09:06 +03:00
|
|
|
const dnode_phys_t *dnp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2008-11-20 23:01:55 +03:00
|
|
|
int l;
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
ASSERT3U(BP_GET_TYPE(bp), ==, dnp->dn_type);
|
|
|
|
ASSERT3U(BP_GET_LEVEL(bp), ==, zb->zb_level);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%16llx ", (u_longlong_t)blkid2offset(dnp, bp, zb));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(zb->zb_level >= 0);
|
|
|
|
|
|
|
|
for (l = dnp->dn_nlevels - 1; l >= -1; l--) {
|
|
|
|
if (l == zb->zb_level) {
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("L%llx", (u_longlong_t)zb->zb_level);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf(" ");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
snprintf_blkptr_compact(blkbuf, sizeof (blkbuf), bp, B_FALSE);
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("%s\n", blkbuf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
visit_indirect(spa_t *spa, const dnode_phys_t *dnp,
|
2014-06-25 22:37:59 +04:00
|
|
|
blkptr_t *bp, const zbookmark_phys_t *zb)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
int err = 0;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
if (bp->blk_birth == 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
print_indirect(bp, zb, dnp);
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_GET_LEVEL(bp) > 0 && !BP_IS_HOLE(bp)) {
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_flags_t flags = ARC_FLAG_WAIT;
|
2008-12-03 23:09:06 +03:00
|
|
|
int i;
|
|
|
|
blkptr_t *cbp;
|
|
|
|
int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
|
|
|
|
arc_buf_t *buf;
|
|
|
|
uint64_t fill = 0;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(!BP_IS_REDACTED(bp));
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-07-03 00:26:24 +04:00
|
|
|
err = arc_read(NULL, spa, bp, arc_getbuf_func, &buf,
|
2008-12-03 23:09:06 +03:00
|
|
|
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
|
|
|
|
if (err)
|
|
|
|
return (err);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(buf->b_data);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/* recursively visit blocks below this */
|
|
|
|
cbp = buf->b_data;
|
|
|
|
for (i = 0; i < epb; i++, cbp++) {
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t czb;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
|
|
|
|
zb->zb_level - 1,
|
|
|
|
zb->zb_blkid * epb + i);
|
|
|
|
err = visit_indirect(spa, dnp, cbp, &czb);
|
|
|
|
if (err)
|
|
|
|
break;
|
2014-06-06 01:19:08 +04:00
|
|
|
fill += BP_GET_FILL(cbp);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!err)
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(fill, ==, BP_GET_FILL(bp));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(buf, &buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_indirect(dnode_t *dn)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
dnode_phys_t *dnp = dn->dn_phys;
|
|
|
|
int j;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t czb;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("Indirect blocks:\n");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
SET_BOOKMARK(&czb, dmu_objset_id(dn->dn_objset),
|
2008-12-03 23:09:06 +03:00
|
|
|
dn->dn_object, dnp->dn_nlevels - 1, 0);
|
|
|
|
for (j = 0; j < dnp->dn_nblkptr; j++) {
|
|
|
|
czb.zb_blkid = j;
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) visit_indirect(dmu_objset_spa(dn->dn_objset), dnp,
|
2008-12-03 23:09:06 +03:00
|
|
|
&dnp->dn_blkptr[j], &czb);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dsl_dir(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dsl_dir_phys_t *dd = data;
|
|
|
|
time_t crtime;
|
2010-05-29 00:45:14 +04:00
|
|
|
char nice[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (nice) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dd == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT3U(size, >=, sizeof (dsl_dir_phys_t));
|
|
|
|
|
|
|
|
crtime = dd->dd_creation_time;
|
|
|
|
(void) printf("\t\tcreation_time = %s", ctime(&crtime));
|
|
|
|
(void) printf("\t\thead_dataset_obj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_head_dataset_obj);
|
|
|
|
(void) printf("\t\tparent_dir_obj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_parent_obj);
|
|
|
|
(void) printf("\t\torigin_obj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_origin_obj);
|
|
|
|
(void) printf("\t\tchild_dir_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_child_dir_zapobj);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_used_bytes, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tused_bytes = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_compressed_bytes, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tcompressed_bytes = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_uncompressed_bytes, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tuncompressed_bytes = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_quota, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tquota = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_reserved, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\treserved = %s\n", nice);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tprops_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_props_zapobj);
|
|
|
|
(void) printf("\t\tdeleg_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_deleg_zapobj);
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tflags = %llx\n",
|
|
|
|
(u_longlong_t)dd->dd_flags);
|
|
|
|
|
|
|
|
#define DO(which) \
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_used_breakdown[DD_USED_ ## which], nice, \
|
|
|
|
sizeof (nice)); \
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tused_breakdown[" #which "] = %s\n", nice)
|
|
|
|
DO(HEAD);
|
|
|
|
DO(SNAP);
|
|
|
|
DO(CHILD);
|
|
|
|
DO(CHILD_RSRV);
|
|
|
|
DO(REFRSRV);
|
|
|
|
#undef DO
|
2018-10-12 21:28:26 +03:00
|
|
|
(void) printf("\t\tclones = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_clones);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dsl_dataset(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dsl_dataset_phys_t *ds = data;
|
|
|
|
time_t crtime;
|
2010-05-29 00:45:14 +04:00
|
|
|
char used[32], compressed[32], uncompressed[32], unique[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (used) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (compressed) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncompressed) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (unique) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ds == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(size == sizeof (*ds));
|
|
|
|
crtime = ds->ds_creation_time;
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(ds->ds_referenced_bytes, used, sizeof (used));
|
|
|
|
zdb_nicenum(ds->ds_compressed_bytes, compressed, sizeof (compressed));
|
|
|
|
zdb_nicenum(ds->ds_uncompressed_bytes, uncompressed,
|
|
|
|
sizeof (uncompressed));
|
|
|
|
zdb_nicenum(ds->ds_unique_bytes, unique, sizeof (unique));
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), &ds->ds_bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tdir_obj = %llu\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)ds->ds_dir_obj);
|
|
|
|
(void) printf("\t\tprev_snap_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_prev_snap_obj);
|
|
|
|
(void) printf("\t\tprev_snap_txg = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_prev_snap_txg);
|
|
|
|
(void) printf("\t\tnext_snap_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_next_snap_obj);
|
|
|
|
(void) printf("\t\tsnapnames_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_snapnames_zapobj);
|
|
|
|
(void) printf("\t\tnum_children = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_num_children);
|
2009-08-18 22:43:27 +04:00
|
|
|
(void) printf("\t\tuserrefs_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_userrefs_obj);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tcreation_time = %s", ctime(&crtime));
|
|
|
|
(void) printf("\t\tcreation_txg = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_creation_txg);
|
|
|
|
(void) printf("\t\tdeadlist_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_deadlist_obj);
|
|
|
|
(void) printf("\t\tused_bytes = %s\n", used);
|
|
|
|
(void) printf("\t\tcompressed_bytes = %s\n", compressed);
|
|
|
|
(void) printf("\t\tuncompressed_bytes = %s\n", uncompressed);
|
|
|
|
(void) printf("\t\tunique = %s\n", unique);
|
|
|
|
(void) printf("\t\tfsid_guid = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_fsid_guid);
|
|
|
|
(void) printf("\t\tguid = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_guid);
|
|
|
|
(void) printf("\t\tflags = %llx\n",
|
|
|
|
(u_longlong_t)ds->ds_flags);
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tnext_clones_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_next_clones_obj);
|
|
|
|
(void) printf("\t\tprops_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_props_obj);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tbp = %s\n", blkbuf);
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
dump_bptree_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
|
|
|
if (bp->blk_birth != 0) {
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2012-12-14 03:24:15 +04:00
|
|
|
(void) printf("\t%s\n", blkbuf);
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2017-10-27 22:46:35 +03:00
|
|
|
dump_bptree(objset_t *os, uint64_t obj, const char *name)
|
2012-12-14 03:24:15 +04:00
|
|
|
{
|
|
|
|
char bytes[32];
|
|
|
|
bptree_phys_t *bt;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (dump_opt['d'] < 3)
|
|
|
|
return;
|
|
|
|
|
|
|
|
VERIFY3U(0, ==, dmu_bonus_hold(os, obj, FTAG, &db));
|
|
|
|
bt = db->db_data;
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bt->bt_bytes, bytes, sizeof (bytes));
|
2012-12-14 03:24:15 +04:00
|
|
|
(void) printf("\n %s: %llu datasets, %s\n",
|
|
|
|
name, (unsigned long long)(bt->bt_end - bt->bt_begin), bytes);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
if (dump_opt['d'] < 5)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
(void) bptree_iterate(os, obj, B_FALSE, dump_bptree_cb, NULL, NULL);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_bpobj_cb(void *arg, const blkptr_t *bp, boolean_t bp_freed, dmu_tx_t *tx)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
|
|
|
ASSERT(bp->blk_birth != 0);
|
2019-07-26 20:54:14 +03:00
|
|
|
snprintf_blkptr_compact(blkbuf, sizeof (blkbuf), bp, bp_freed);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\t%s\n", blkbuf);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2017-10-27 22:46:35 +03:00
|
|
|
dump_full_bpobj(bpobj_t *bpo, const char *name, int indent)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
char bytes[32];
|
|
|
|
char comp[32];
|
|
|
|
char uncomp[32];
|
2013-07-05 23:37:16 +04:00
|
|
|
uint64_t i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (comp) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncomp) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['d'] < 3)
|
|
|
|
return;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bpo->bpo_phys->bpo_bytes, bytes, sizeof (bytes));
|
2013-07-05 23:37:16 +04:00
|
|
|
if (bpo->bpo_havesubobj && bpo->bpo_phys->bpo_subobjs != 0) {
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bpo->bpo_phys->bpo_comp, comp, sizeof (comp));
|
|
|
|
zdb_nicenum(bpo->bpo_phys->bpo_uncomp, uncomp, sizeof (uncomp));
|
2019-07-26 20:54:14 +03:00
|
|
|
if (bpo->bpo_havefreed) {
|
|
|
|
(void) printf(" %*s: object %llu, %llu local "
|
|
|
|
"blkptrs, %llu freed, %llu subobjs in object %llu, "
|
|
|
|
"%s (%s/%s comp)\n",
|
|
|
|
indent * 8, name,
|
|
|
|
(u_longlong_t)bpo->bpo_object,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_blkptrs,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_freed,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_subobjs,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_subobjs,
|
|
|
|
bytes, comp, uncomp);
|
|
|
|
} else {
|
|
|
|
(void) printf(" %*s: object %llu, %llu local "
|
|
|
|
"blkptrs, %llu subobjs in object %llu, "
|
|
|
|
"%s (%s/%s comp)\n",
|
|
|
|
indent * 8, name,
|
|
|
|
(u_longlong_t)bpo->bpo_object,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_blkptrs,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_subobjs,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_subobjs,
|
|
|
|
bytes, comp, uncomp);
|
|
|
|
}
|
2013-07-05 23:37:16 +04:00
|
|
|
|
|
|
|
for (i = 0; i < bpo->bpo_phys->bpo_num_subobjs; i++) {
|
|
|
|
uint64_t subobj;
|
|
|
|
bpobj_t subbpo;
|
|
|
|
int error;
|
|
|
|
VERIFY0(dmu_read(bpo->bpo_os,
|
|
|
|
bpo->bpo_phys->bpo_subobjs,
|
|
|
|
i * sizeof (subobj), sizeof (subobj), &subobj, 0));
|
|
|
|
error = bpobj_open(&subbpo, bpo->bpo_os, subobj);
|
|
|
|
if (error != 0) {
|
|
|
|
(void) printf("ERROR %u while trying to open "
|
|
|
|
"subobj id %llu\n",
|
|
|
|
error, (u_longlong_t)subobj);
|
|
|
|
continue;
|
|
|
|
}
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&subbpo, "subobj", indent + 1);
|
2015-12-31 18:57:11 +03:00
|
|
|
bpobj_close(&subbpo);
|
2013-07-05 23:37:16 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2019-07-26 20:54:14 +03:00
|
|
|
if (bpo->bpo_havefreed) {
|
|
|
|
(void) printf(" %*s: object %llu, %llu blkptrs, "
|
|
|
|
"%llu freed, %s\n",
|
|
|
|
indent * 8, name,
|
|
|
|
(u_longlong_t)bpo->bpo_object,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_blkptrs,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_freed,
|
|
|
|
bytes);
|
|
|
|
} else {
|
|
|
|
(void) printf(" %*s: object %llu, %llu blkptrs, "
|
|
|
|
"%s\n",
|
|
|
|
indent * 8, name,
|
|
|
|
(u_longlong_t)bpo->bpo_object,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_blkptrs,
|
|
|
|
bytes);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['d'] < 5)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
|
2013-07-05 23:37:16 +04:00
|
|
|
if (indent == 0) {
|
|
|
|
(void) bpobj_iterate_nofree(bpo, dump_bpobj_cb, NULL, NULL);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
static int
|
|
|
|
dump_bookmark(dsl_pool_t *dp, char *name, boolean_t print_redact,
|
|
|
|
boolean_t print_list)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
zfs_bookmark_phys_t prop;
|
|
|
|
objset_t *mos = dp->dp_spa->spa_meta_objset;
|
|
|
|
err = dsl_bookmark_lookup(dp, name, NULL, &prop);
|
|
|
|
|
|
|
|
if (err != 0) {
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\t#%s: ", strchr(name, '#') + 1);
|
|
|
|
(void) printf("{guid: %llx creation_txg: %llu creation_time: "
|
|
|
|
"%llu redaction_obj: %llu}\n", (u_longlong_t)prop.zbm_guid,
|
|
|
|
(u_longlong_t)prop.zbm_creation_txg,
|
|
|
|
(u_longlong_t)prop.zbm_creation_time,
|
|
|
|
(u_longlong_t)prop.zbm_redaction_obj);
|
|
|
|
|
|
|
|
IMPLY(print_list, print_redact);
|
|
|
|
if (!print_redact || prop.zbm_redaction_obj == 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
redaction_list_t *rl;
|
|
|
|
VERIFY0(dsl_redaction_list_hold_obj(dp,
|
|
|
|
prop.zbm_redaction_obj, FTAG, &rl));
|
|
|
|
|
|
|
|
redaction_list_phys_t *rlp = rl->rl_phys;
|
|
|
|
(void) printf("\tRedacted:\n\t\tProgress: ");
|
|
|
|
if (rlp->rlp_last_object != UINT64_MAX ||
|
|
|
|
rlp->rlp_last_blkid != UINT64_MAX) {
|
|
|
|
(void) printf("%llu %llu (incomplete)\n",
|
|
|
|
(u_longlong_t)rlp->rlp_last_object,
|
|
|
|
(u_longlong_t)rlp->rlp_last_blkid);
|
|
|
|
} else {
|
|
|
|
(void) printf("complete\n");
|
|
|
|
}
|
|
|
|
(void) printf("\t\tSnapshots: [");
|
|
|
|
for (unsigned int i = 0; i < rlp->rlp_num_snaps; i++) {
|
|
|
|
if (i > 0)
|
|
|
|
(void) printf(", ");
|
|
|
|
(void) printf("%0llu",
|
|
|
|
(u_longlong_t)rlp->rlp_snaps[i]);
|
|
|
|
}
|
|
|
|
(void) printf("]\n\t\tLength: %llu\n",
|
|
|
|
(u_longlong_t)rlp->rlp_num_entries);
|
|
|
|
|
|
|
|
if (!print_list) {
|
|
|
|
dsl_redaction_list_rele(rl, FTAG);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (rlp->rlp_num_entries == 0) {
|
|
|
|
dsl_redaction_list_rele(rl, FTAG);
|
|
|
|
(void) printf("\t\tRedaction List: []\n\n");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
redact_block_phys_t *rbp_buf;
|
|
|
|
uint64_t size;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
|
|
|
|
VERIFY0(dmu_object_info(mos, prop.zbm_redaction_obj, &doi));
|
|
|
|
size = doi.doi_max_offset;
|
|
|
|
rbp_buf = kmem_alloc(size, KM_SLEEP);
|
|
|
|
|
|
|
|
err = dmu_read(mos, prop.zbm_redaction_obj, 0, size,
|
|
|
|
rbp_buf, 0);
|
|
|
|
if (err != 0) {
|
|
|
|
dsl_redaction_list_rele(rl, FTAG);
|
|
|
|
kmem_free(rbp_buf, size);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\t\tRedaction List: [{object: %llx, offset: "
|
|
|
|
"%llx, blksz: %x, count: %llx}",
|
|
|
|
(u_longlong_t)rbp_buf[0].rbp_object,
|
|
|
|
(u_longlong_t)rbp_buf[0].rbp_blkid,
|
|
|
|
(uint_t)(redact_block_get_size(&rbp_buf[0])),
|
|
|
|
(u_longlong_t)redact_block_get_count(&rbp_buf[0]));
|
|
|
|
|
|
|
|
for (size_t i = 1; i < rlp->rlp_num_entries; i++) {
|
|
|
|
(void) printf(",\n\t\t{object: %llx, offset: %llx, "
|
|
|
|
"blksz: %x, count: %llx}",
|
|
|
|
(u_longlong_t)rbp_buf[i].rbp_object,
|
|
|
|
(u_longlong_t)rbp_buf[i].rbp_blkid,
|
|
|
|
(uint_t)(redact_block_get_size(&rbp_buf[i])),
|
|
|
|
(u_longlong_t)redact_block_get_count(&rbp_buf[i]));
|
|
|
|
}
|
|
|
|
dsl_redaction_list_rele(rl, FTAG);
|
|
|
|
kmem_free(rbp_buf, size);
|
|
|
|
(void) printf("]\n\n");
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_bookmarks(objset_t *os, int verbosity)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
dsl_dataset_t *ds = dmu_objset_ds(os);
|
|
|
|
dsl_pool_t *dp = spa_get_dsl(os->os_spa);
|
|
|
|
objset_t *mos = os->os_spa->spa_meta_objset;
|
|
|
|
if (verbosity < 4)
|
|
|
|
return;
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, mos, ds->ds_bookmarks_obj);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
char osname[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
char buf[ZFS_MAX_DATASET_NAME_LEN];
|
|
|
|
dmu_objset_name(os, osname);
|
2019-06-25 04:06:26 +03:00
|
|
|
VERIFY3S(0, <=, snprintf(buf, sizeof (buf), "%s#%s", osname,
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
attr.za_name));
|
|
|
|
(void) dump_bookmark(dp, buf, verbosity >= 5, verbosity >= 6);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
}
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
static void
|
|
|
|
bpobj_count_refd(bpobj_t *bpo)
|
|
|
|
{
|
|
|
|
mos_obj_refd(bpo->bpo_object);
|
|
|
|
|
|
|
|
if (bpo->bpo_havesubobj && bpo->bpo_phys->bpo_subobjs != 0) {
|
|
|
|
mos_obj_refd(bpo->bpo_phys->bpo_subobjs);
|
|
|
|
for (uint64_t i = 0; i < bpo->bpo_phys->bpo_num_subobjs; i++) {
|
|
|
|
uint64_t subobj;
|
|
|
|
bpobj_t subbpo;
|
|
|
|
int error;
|
|
|
|
VERIFY0(dmu_read(bpo->bpo_os,
|
|
|
|
bpo->bpo_phys->bpo_subobjs,
|
|
|
|
i * sizeof (subobj), sizeof (subobj), &subobj, 0));
|
|
|
|
error = bpobj_open(&subbpo, bpo->bpo_os, subobj);
|
|
|
|
if (error != 0) {
|
|
|
|
(void) printf("ERROR %u while trying to open "
|
|
|
|
"subobj id %llu\n",
|
|
|
|
error, (u_longlong_t)subobj);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
bpobj_count_refd(&subbpo);
|
|
|
|
bpobj_close(&subbpo);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
static int
|
|
|
|
dsl_deadlist_entry_count_refd(void *arg, dsl_deadlist_entry_t *dle)
|
|
|
|
{
|
|
|
|
spa_t *spa = arg;
|
|
|
|
uint64_t empty_bpobj = spa->spa_dsl_pool->dp_empty_bpobj;
|
|
|
|
if (dle->dle_bpobj.bpo_object != empty_bpobj)
|
|
|
|
bpobj_count_refd(&dle->dle_bpobj);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
dsl_deadlist_entry_dump(void *arg, dsl_deadlist_entry_t *dle)
|
|
|
|
{
|
|
|
|
ASSERT(arg == NULL);
|
|
|
|
if (dump_opt['d'] >= 5) {
|
|
|
|
char buf[128];
|
|
|
|
(void) snprintf(buf, sizeof (buf),
|
|
|
|
"mintxg %llu -> obj %llu",
|
|
|
|
(longlong_t)dle->dle_mintxg,
|
|
|
|
(longlong_t)dle->dle_bpobj.bpo_object);
|
|
|
|
|
|
|
|
dump_full_bpobj(&dle->dle_bpobj, buf, 0);
|
|
|
|
} else {
|
|
|
|
(void) printf("mintxg %llu -> obj %llu\n",
|
|
|
|
(longlong_t)dle->dle_mintxg,
|
|
|
|
(longlong_t)dle->dle_bpobj.bpo_object);
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_blkptr_list(dsl_deadlist_t *dl, char *name)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
char bytes[32];
|
|
|
|
char comp[32];
|
|
|
|
char uncomp[32];
|
2019-07-26 20:54:14 +03:00
|
|
|
char entries[32];
|
|
|
|
spa_t *spa = dmu_objset_spa(dl->dl_os);
|
|
|
|
uint64_t empty_bpobj = spa->spa_dsl_pool->dp_empty_bpobj;
|
2018-10-12 21:28:26 +03:00
|
|
|
|
|
|
|
if (dl->dl_oldfmt) {
|
|
|
|
if (dl->dl_bpobj.bpo_object != empty_bpobj)
|
|
|
|
bpobj_count_refd(&dl->dl_bpobj);
|
|
|
|
} else {
|
|
|
|
mos_obj_refd(dl->dl_object);
|
2019-07-26 20:54:14 +03:00
|
|
|
dsl_deadlist_iterate(dl, dsl_deadlist_entry_count_refd, spa);
|
2018-10-12 21:28:26 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (comp) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncomp) >= NN_NUMBUF_SZ);
|
2019-07-26 20:54:14 +03:00
|
|
|
CTASSERT(sizeof (entries) >= NN_NUMBUF_SZ);
|
2017-06-13 12:16:45 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['d'] < 3)
|
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-09-17 11:14:39 +04:00
|
|
|
if (dl->dl_oldfmt) {
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&dl->dl_bpobj, "old-format deadlist", 0);
|
2014-09-17 11:14:39 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dl->dl_phys->dl_used, bytes, sizeof (bytes));
|
|
|
|
zdb_nicenum(dl->dl_phys->dl_comp, comp, sizeof (comp));
|
|
|
|
zdb_nicenum(dl->dl_phys->dl_uncomp, uncomp, sizeof (uncomp));
|
2019-07-26 20:54:14 +03:00
|
|
|
zdb_nicenum(avl_numnodes(&dl->dl_tree), entries, sizeof (entries));
|
|
|
|
(void) printf("\n %s: %s (%s/%s comp), %s entries\n",
|
|
|
|
name, bytes, comp, uncomp, entries);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (dump_opt['d'] < 4)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
dsl_deadlist_iterate(dl, dsl_deadlist_entry_dump, NULL);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
static int
|
|
|
|
verify_dd_livelist(objset_t *os)
|
|
|
|
{
|
|
|
|
uint64_t ll_used, used, ll_comp, comp, ll_uncomp, uncomp;
|
|
|
|
dsl_pool_t *dp = spa_get_dsl(os->os_spa);
|
|
|
|
dsl_dir_t *dd = os->os_dsl_dataset->ds_dir;
|
|
|
|
|
|
|
|
ASSERT(!dmu_objset_is_snapshot(os));
|
|
|
|
if (!dsl_deadlist_is_open(&dd->dd_livelist))
|
|
|
|
return (0);
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
|
|
|
dsl_deadlist_space(&dd->dd_livelist, &ll_used,
|
|
|
|
&ll_comp, &ll_uncomp);
|
|
|
|
|
|
|
|
dsl_dataset_t *origin_ds;
|
|
|
|
ASSERT(dsl_pool_config_held(dp));
|
|
|
|
VERIFY0(dsl_dataset_hold_obj(dp,
|
|
|
|
dsl_dir_phys(dd)->dd_origin_obj, FTAG, &origin_ds));
|
|
|
|
VERIFY0(dsl_dataset_space_written(origin_ds, os->os_dsl_dataset,
|
|
|
|
&used, &comp, &uncomp));
|
|
|
|
dsl_dataset_rele(origin_ds, FTAG);
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
/*
|
|
|
|
* It's possible that the dataset's uncomp space is larger than the
|
|
|
|
* livelist's because livelists do not track embedded block pointers
|
|
|
|
*/
|
|
|
|
if (used != ll_used || comp != ll_comp || uncomp < ll_uncomp) {
|
|
|
|
char nice_used[32], nice_comp[32], nice_uncomp[32];
|
|
|
|
(void) printf("Discrepancy in space accounting:\n");
|
|
|
|
zdb_nicenum(used, nice_used, sizeof (nice_used));
|
|
|
|
zdb_nicenum(comp, nice_comp, sizeof (nice_comp));
|
|
|
|
zdb_nicenum(uncomp, nice_uncomp, sizeof (nice_uncomp));
|
|
|
|
(void) printf("dir: used %s, comp %s, uncomp %s\n",
|
|
|
|
nice_used, nice_comp, nice_uncomp);
|
|
|
|
zdb_nicenum(ll_used, nice_used, sizeof (nice_used));
|
|
|
|
zdb_nicenum(ll_comp, nice_comp, sizeof (nice_comp));
|
|
|
|
zdb_nicenum(ll_uncomp, nice_uncomp, sizeof (nice_uncomp));
|
|
|
|
(void) printf("livelist: used %s, comp %s, uncomp %s\n",
|
|
|
|
nice_used, nice_comp, nice_uncomp);
|
|
|
|
return (1);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2019-07-26 20:54:14 +03:00
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static avl_tree_t idx_tree;
|
|
|
|
static avl_tree_t domain_tree;
|
|
|
|
static boolean_t fuid_table_loaded;
|
2017-04-13 19:40:56 +03:00
|
|
|
static objset_t *sa_os = NULL;
|
|
|
|
static sa_attr_type_t *sa_attr_table = NULL;
|
|
|
|
|
|
|
|
static int
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
open_objset(const char *path, void *tag, objset_t **osp)
|
2017-04-13 19:40:56 +03:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
uint64_t sa_attrs = 0;
|
|
|
|
uint64_t version = 0;
|
|
|
|
|
|
|
|
VERIFY3P(sa_os, ==, NULL);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
/*
|
|
|
|
* We can't own an objset if it's redacted. Therefore, we do this
|
|
|
|
* dance: hold the objset, then acquire a long hold on its dataset, then
|
|
|
|
* release the pool (which is held as part of holding the objset).
|
|
|
|
*/
|
|
|
|
err = dmu_objset_hold(path, tag, osp);
|
2017-04-13 19:40:56 +03:00
|
|
|
if (err != 0) {
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
(void) fprintf(stderr, "failed to hold dataset '%s': %s\n",
|
|
|
|
path, strerror(err));
|
2017-04-13 19:40:56 +03:00
|
|
|
return (err);
|
|
|
|
}
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dsl_dataset_long_hold(dmu_objset_ds(*osp), tag);
|
|
|
|
dsl_pool_rele(dmu_objset_pool(*osp), tag);
|
2017-04-13 19:40:56 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (dmu_objset_type(*osp) == DMU_OST_ZFS && !(*osp)->os_encrypted) {
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) zap_lookup(*osp, MASTER_NODE_OBJ, ZPL_VERSION_STR,
|
|
|
|
8, 1, &version);
|
|
|
|
if (version >= ZPL_VERSION_SA) {
|
|
|
|
(void) zap_lookup(*osp, MASTER_NODE_OBJ, ZFS_SA_ATTRS,
|
|
|
|
8, 1, &sa_attrs);
|
|
|
|
}
|
|
|
|
err = sa_setup(*osp, sa_attrs, zfs_attr_table, ZPL_END,
|
|
|
|
&sa_attr_table);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "sa_setup failed: %s\n",
|
|
|
|
strerror(err));
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dsl_dataset_long_rele(dmu_objset_ds(*osp), tag);
|
|
|
|
dsl_dataset_rele(dmu_objset_ds(*osp), tag);
|
2017-04-13 19:40:56 +03:00
|
|
|
*osp = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
sa_os = *osp;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
close_objset(objset_t *os, void *tag)
|
|
|
|
{
|
|
|
|
VERIFY3P(os, ==, sa_os);
|
|
|
|
if (os->os_sa != NULL)
|
|
|
|
sa_tear_down(os);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dsl_dataset_long_rele(dmu_objset_ds(os), tag);
|
|
|
|
dsl_dataset_rele(dmu_objset_ds(os), tag);
|
2017-04-13 19:40:56 +03:00
|
|
|
sa_attr_table = NULL;
|
|
|
|
sa_os = NULL;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
2010-08-26 20:52:41 +04:00
|
|
|
fuid_table_destroy(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
if (fuid_table_loaded) {
|
|
|
|
zfs_fuid_table_destroy(&idx_tree, &domain_tree);
|
|
|
|
fuid_table_loaded = B_FALSE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* print uid or gid information.
|
|
|
|
* For normal POSIX id just the id is printed in decimal format.
|
|
|
|
* For CIFS files with FUID the fuid is printed in hex followed by
|
2013-07-05 23:37:16 +04:00
|
|
|
* the domain-rid string.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
print_idstr(uint64_t id, const char *id_type)
|
|
|
|
{
|
|
|
|
if (FUID_INDEX(id)) {
|
|
|
|
char *domain;
|
|
|
|
|
|
|
|
domain = zfs_fuid_idx_domain(&idx_tree, FUID_INDEX(id));
|
|
|
|
(void) printf("\t%s %llx [%s-%d]\n", id_type,
|
|
|
|
(u_longlong_t)id, domain, (int)FUID_RID(id));
|
|
|
|
} else {
|
|
|
|
(void) printf("\t%s %llu\n", id_type, (u_longlong_t)id);
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uidgid(objset_t *os, uint64_t uid, uint64_t gid)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
uint32_t uid_idx, gid_idx;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
uid_idx = FUID_INDEX(uid);
|
|
|
|
gid_idx = FUID_INDEX(gid);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* Load domain table, if not already loaded */
|
|
|
|
if (!fuid_table_loaded && (uid_idx || gid_idx)) {
|
|
|
|
uint64_t fuid_obj;
|
|
|
|
|
|
|
|
/* first find the fuid object. It lives in the master node */
|
|
|
|
VERIFY(zap_lookup(os, MASTER_NODE_OBJ, ZFS_FUID_TABLES,
|
|
|
|
8, 1, &fuid_obj) == 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
zfs_fuid_avl_tree_create(&idx_tree, &domain_tree);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) zfs_fuid_table_load(os, fuid_obj,
|
|
|
|
&idx_tree, &domain_tree);
|
|
|
|
fuid_table_loaded = B_TRUE;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
print_idstr(uid, "uid");
|
|
|
|
print_idstr(gid, "gid");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-07-09 16:15:26 +04:00
|
|
|
static void
|
|
|
|
dump_znode_sa_xattr(sa_handle_t *hdl)
|
|
|
|
{
|
|
|
|
nvlist_t *sa_xattr;
|
|
|
|
nvpair_t *elem = NULL;
|
|
|
|
int sa_xattr_size = 0;
|
|
|
|
int sa_xattr_entries = 0;
|
|
|
|
int error;
|
|
|
|
char *sa_xattr_packed;
|
|
|
|
|
|
|
|
error = sa_size(hdl, sa_attr_table[ZPL_DXATTR], &sa_xattr_size);
|
|
|
|
if (error || sa_xattr_size == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
sa_xattr_packed = malloc(sa_xattr_size);
|
|
|
|
if (sa_xattr_packed == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
error = sa_lookup(hdl, sa_attr_table[ZPL_DXATTR],
|
|
|
|
sa_xattr_packed, sa_xattr_size);
|
|
|
|
if (error) {
|
|
|
|
free(sa_xattr_packed);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = nvlist_unpack(sa_xattr_packed, sa_xattr_size, &sa_xattr, 0);
|
|
|
|
if (error) {
|
|
|
|
free(sa_xattr_packed);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
while ((elem = nvlist_next_nvpair(sa_xattr, elem)) != NULL)
|
|
|
|
sa_xattr_entries++;
|
|
|
|
|
|
|
|
(void) printf("\tSA xattrs: %d bytes, %d entries\n\n",
|
|
|
|
sa_xattr_size, sa_xattr_entries);
|
|
|
|
while ((elem = nvlist_next_nvpair(sa_xattr, elem)) != NULL) {
|
|
|
|
uchar_t *value;
|
|
|
|
uint_t cnt, idx;
|
|
|
|
|
|
|
|
(void) printf("\t\t%s = ", nvpair_name(elem));
|
|
|
|
nvpair_value_byte_array(elem, &value, &cnt);
|
2013-11-01 23:26:11 +04:00
|
|
|
for (idx = 0; idx < cnt; ++idx) {
|
2013-07-09 16:15:26 +04:00
|
|
|
if (isprint(value[idx]))
|
|
|
|
(void) putchar(value[idx]);
|
|
|
|
else
|
|
|
|
(void) printf("\\%3.3o", value[idx]);
|
|
|
|
}
|
|
|
|
(void) putchar('\n');
|
|
|
|
}
|
|
|
|
|
|
|
|
nvlist_free(sa_xattr);
|
|
|
|
free(sa_xattr_packed);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_znode(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
char path[MAXPATHLEN * 2]; /* allow for xattr and failure prefix */
|
2010-05-29 00:45:14 +04:00
|
|
|
sa_handle_t *hdl;
|
|
|
|
uint64_t xattr, rdev, gen;
|
|
|
|
uint64_t uid, gid, mode, fsize, parent, links;
|
|
|
|
uint64_t pflags;
|
|
|
|
uint64_t acctm[2], modtm[2], chgtm[2], crtm[2];
|
|
|
|
time_t z_crtime, z_atime, z_mtime, z_ctime;
|
|
|
|
sa_bulk_attr_t bulk[12];
|
|
|
|
int idx = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
VERIFY3P(os, ==, sa_os);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (sa_handle_get(os, object, NULL, SA_HDL_PRIVATE, &hdl)) {
|
|
|
|
(void) printf("Failed to get handle for SA znode\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_UID], NULL, &uid, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_GID], NULL, &gid, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_LINKS], NULL,
|
|
|
|
&links, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_GEN], NULL, &gen, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_MODE], NULL,
|
|
|
|
&mode, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_PARENT],
|
|
|
|
NULL, &parent, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_SIZE], NULL,
|
|
|
|
&fsize, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_ATIME], NULL,
|
|
|
|
acctm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_MTIME], NULL,
|
|
|
|
modtm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_CRTIME], NULL,
|
|
|
|
crtm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_CTIME], NULL,
|
|
|
|
chgtm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_FLAGS], NULL,
|
|
|
|
&pflags, 8);
|
|
|
|
|
|
|
|
if (sa_bulk_lookup(hdl, bulk, idx)) {
|
|
|
|
(void) sa_handle_destroy(hdl);
|
|
|
|
return;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
z_crtime = (time_t)crtm[0];
|
|
|
|
z_atime = (time_t)acctm[0];
|
|
|
|
z_mtime = (time_t)modtm[0];
|
|
|
|
z_ctime = (time_t)chgtm[0];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-14 00:22:32 +03:00
|
|
|
if (dump_opt['d'] > 4) {
|
|
|
|
error = zfs_obj_to_path(os, object, path, sizeof (path));
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
if (error == ESTALE) {
|
|
|
|
(void) snprintf(path, sizeof (path), "on delete queue");
|
|
|
|
} else if (error != 0) {
|
|
|
|
leaked_objects++;
|
2017-04-14 00:22:32 +03:00
|
|
|
(void) snprintf(path, sizeof (path),
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
"path not found, possibly leaked");
|
2017-04-14 00:22:32 +03:00
|
|
|
}
|
|
|
|
(void) printf("\tpath %s\n", path);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uidgid(os, uid, gid);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\tatime %s", ctime(&z_atime));
|
|
|
|
(void) printf("\tmtime %s", ctime(&z_mtime));
|
|
|
|
(void) printf("\tctime %s", ctime(&z_ctime));
|
|
|
|
(void) printf("\tcrtime %s", ctime(&z_crtime));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\tgen %llu\n", (u_longlong_t)gen);
|
|
|
|
(void) printf("\tmode %llo\n", (u_longlong_t)mode);
|
|
|
|
(void) printf("\tsize %llu\n", (u_longlong_t)fsize);
|
|
|
|
(void) printf("\tparent %llu\n", (u_longlong_t)parent);
|
|
|
|
(void) printf("\tlinks %llu\n", (u_longlong_t)links);
|
|
|
|
(void) printf("\tpflags %llx\n", (u_longlong_t)pflags);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(os) && (pflags & ZFS_PROJID)) {
|
|
|
|
uint64_t projid;
|
|
|
|
|
|
|
|
if (sa_lookup(hdl, sa_attr_table[ZPL_PROJID], &projid,
|
|
|
|
sizeof (uint64_t)) == 0)
|
|
|
|
(void) printf("\tprojid %llu\n", (u_longlong_t)projid);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
if (sa_lookup(hdl, sa_attr_table[ZPL_XATTR], &xattr,
|
|
|
|
sizeof (uint64_t)) == 0)
|
|
|
|
(void) printf("\txattr %llu\n", (u_longlong_t)xattr);
|
|
|
|
if (sa_lookup(hdl, sa_attr_table[ZPL_RDEV], &rdev,
|
|
|
|
sizeof (uint64_t)) == 0)
|
|
|
|
(void) printf("\trdev 0x%016llx\n", (u_longlong_t)rdev);
|
2013-07-09 16:15:26 +04:00
|
|
|
dump_znode_sa_xattr(hdl);
|
2010-05-29 00:45:14 +04:00
|
|
|
sa_handle_destroy(hdl);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_acl(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dmu_objset(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static object_viewer_t *object_viewer[DMU_OT_NUMTYPES + 1] = {
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_none, /* unallocated */
|
|
|
|
dump_zap, /* object directory */
|
|
|
|
dump_uint64, /* object array */
|
|
|
|
dump_none, /* packed nvlist */
|
|
|
|
dump_packed_nvlist, /* packed nvlist size */
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_none, /* bpobj */
|
|
|
|
dump_bpobj, /* bpobj header */
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_none, /* SPA space map header */
|
|
|
|
dump_none, /* SPA space map */
|
|
|
|
dump_none, /* ZIL intent log */
|
|
|
|
dump_dnode, /* DMU dnode */
|
|
|
|
dump_dmu_objset, /* DMU objset */
|
|
|
|
dump_dsl_dir, /* DSL directory */
|
|
|
|
dump_zap, /* DSL directory child map */
|
|
|
|
dump_zap, /* DSL dataset snap map */
|
|
|
|
dump_zap, /* DSL props */
|
|
|
|
dump_dsl_dataset, /* DSL dataset */
|
|
|
|
dump_znode, /* ZFS znode */
|
|
|
|
dump_acl, /* ZFS V0 ACL */
|
|
|
|
dump_uint8, /* ZFS plain file */
|
|
|
|
dump_zpldir, /* ZFS directory */
|
|
|
|
dump_zap, /* ZFS master node */
|
|
|
|
dump_zap, /* ZFS delete queue */
|
|
|
|
dump_uint8, /* zvol object */
|
|
|
|
dump_zap, /* zvol prop */
|
|
|
|
dump_uint8, /* other uint8[] */
|
|
|
|
dump_uint64, /* other uint64[] */
|
|
|
|
dump_zap, /* other ZAP */
|
|
|
|
dump_zap, /* persistent error log */
|
|
|
|
dump_uint8, /* SPA history */
|
2013-08-28 15:45:09 +04:00
|
|
|
dump_history_offsets, /* SPA history offsets */
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_zap, /* Pool properties */
|
|
|
|
dump_zap, /* DSL permissions */
|
|
|
|
dump_acl, /* ZFS ACL */
|
|
|
|
dump_uint8, /* ZFS SYSACL */
|
|
|
|
dump_none, /* FUID nvlist */
|
|
|
|
dump_packed_nvlist, /* FUID nvlist size */
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_zap, /* DSL dataset next clones */
|
|
|
|
dump_zap, /* DSL scrub queue */
|
2018-02-14 01:54:54 +03:00
|
|
|
dump_zap, /* ZFS user/group/project used */
|
|
|
|
dump_zap, /* ZFS user/group/project quota */
|
2009-08-18 22:43:27 +04:00
|
|
|
dump_zap, /* snapshot refcount tags */
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_ddt_zap, /* DDT ZAP object */
|
|
|
|
dump_zap, /* DDT statistics */
|
|
|
|
dump_znode, /* SA object */
|
|
|
|
dump_zap, /* SA Master Node */
|
|
|
|
dump_sa_attrs, /* SA attribute registration */
|
|
|
|
dump_sa_layouts, /* SA attribute layouts */
|
|
|
|
dump_zap, /* DSL scrub translations */
|
|
|
|
dump_none, /* fake dedup BP */
|
|
|
|
dump_zap, /* deadlist */
|
|
|
|
dump_none, /* deadlist hdr */
|
|
|
|
dump_zap, /* dsl clones */
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_bpobj_subobjs, /* bpobj subobjs */
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_unknown, /* Unknown type, must be last */
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
static void
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
dump_object(objset_t *os, uint64_t object, int verbosity,
|
|
|
|
boolean_t *print_header, uint64_t *dnode_slots_used)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dmu_buf_t *db = NULL;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dnode_t *dn;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t dnode_held = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
void *bonus = NULL;
|
|
|
|
size_t bsize = 0;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
char iblk[32], dblk[32], lsize[32], asize[32], fill[32], dnsize[32];
|
2010-05-29 00:45:14 +04:00
|
|
|
char bonus_size[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
char aux[50];
|
|
|
|
int error;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (iblk) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (dblk) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (lsize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (asize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (bonus_size) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (*print_header) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
(void) printf("\n%10s %3s %5s %5s %5s %6s %5s %6s %s\n",
|
|
|
|
"Object", "lvl", "iblk", "dblk", "dsize", "dnsize",
|
|
|
|
"lsize", "%full", "type");
|
2008-11-20 23:01:55 +03:00
|
|
|
*print_header = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (object == 0) {
|
2010-08-27 01:24:34 +04:00
|
|
|
dn = DMU_META_DNODE(os);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_object_info_from_dnode(dn, &doi);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Encrypted datasets will have sensitive bonus buffers
|
|
|
|
* encrypted. Therefore we cannot hold the bonus buffer and
|
|
|
|
* must hold the dnode itself instead.
|
|
|
|
*/
|
|
|
|
error = dmu_object_info(os, object, &doi);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
fatal("dmu_object_info() failed, errno %u", error);
|
|
|
|
|
|
|
|
if (os->os_encrypted &&
|
|
|
|
DMU_OT_IS_ENCRYPTED(doi.doi_bonus_type)) {
|
|
|
|
error = dnode_hold(os, object, FTAG, &dn);
|
|
|
|
if (error)
|
|
|
|
fatal("dnode_hold() failed, errno %u", error);
|
|
|
|
dnode_held = B_TRUE;
|
|
|
|
} else {
|
|
|
|
error = dmu_bonus_hold(os, object, FTAG, &db);
|
|
|
|
if (error)
|
|
|
|
fatal("dmu_bonus_hold(%llu) failed, errno %u",
|
|
|
|
object, error);
|
|
|
|
bonus = db->db_data;
|
|
|
|
bsize = db->db_size;
|
|
|
|
dn = DB_DNODE((dmu_buf_impl_t *)db);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
if (dnode_slots_used)
|
|
|
|
*dnode_slots_used = doi.doi_dnodesize / DNODE_MIN_SIZE;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(doi.doi_metadata_block_size, iblk, sizeof (iblk));
|
|
|
|
zdb_nicenum(doi.doi_data_block_size, dblk, sizeof (dblk));
|
|
|
|
zdb_nicenum(doi.doi_max_offset, lsize, sizeof (lsize));
|
|
|
|
zdb_nicenum(doi.doi_physical_blocks_512 << 9, asize, sizeof (asize));
|
|
|
|
zdb_nicenum(doi.doi_bonus_size, bonus_size, sizeof (bonus_size));
|
|
|
|
zdb_nicenum(doi.doi_dnodesize, dnsize, sizeof (dnsize));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) sprintf(fill, "%6.2f", 100.0 * doi.doi_fill_count *
|
|
|
|
doi.doi_data_block_size / (object == 0 ? DNODES_PER_BLOCK : 1) /
|
|
|
|
doi.doi_max_offset);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
aux[0] = '\0';
|
|
|
|
|
|
|
|
if (doi.doi_checksum != ZIO_CHECKSUM_INHERIT || verbosity >= 6) {
|
2017-10-17 01:32:48 +03:00
|
|
|
(void) snprintf(aux + strlen(aux), sizeof (aux) - strlen(aux),
|
|
|
|
" (K=%s)", ZDB_CHECKSUM_NAME(doi.doi_checksum));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (doi.doi_compress != ZIO_COMPRESS_INHERIT || verbosity >= 6) {
|
2017-10-17 01:32:48 +03:00
|
|
|
(void) snprintf(aux + strlen(aux), sizeof (aux) - strlen(aux),
|
|
|
|
" (Z=%s)", ZDB_COMPRESS_NAME(doi.doi_compress));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
(void) printf("%10lld %3u %5s %5s %5s %6s %5s %6s %s%s\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)object, doi.doi_indirection, iblk, dblk,
|
2016-08-01 20:42:04 +03:00
|
|
|
asize, dnsize, lsize, fill, zdb_ot_name(doi.doi_type), aux);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (doi.doi_bonus_type != DMU_OT_NONE && verbosity > 3) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
(void) printf("%10s %3s %5s %5s %5s %5s %5s %6s %s\n",
|
|
|
|
"", "", "", "", "", "", bonus_size, "bonus",
|
2016-08-01 20:42:04 +03:00
|
|
|
zdb_ot_name(doi.doi_bonus_type));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (verbosity >= 4) {
|
2016-10-04 21:46:10 +03:00
|
|
|
(void) printf("\tdnode flags: %s%s%s%s\n",
|
2009-07-03 02:44:48 +04:00
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) ?
|
|
|
|
"USED_BYTES " : "",
|
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_USERUSED_ACCOUNTED) ?
|
2010-05-29 00:45:14 +04:00
|
|
|
"USERUSED_ACCOUNTED " : "",
|
2016-10-04 21:46:10 +03:00
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_USEROBJUSED_ACCOUNTED) ?
|
|
|
|
"USEROBJUSED_ACCOUNTED " : "",
|
2010-05-29 00:45:14 +04:00
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR) ?
|
|
|
|
"SPILL_BLKPTR" : "");
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("\tdnode maxblkid: %llu\n",
|
|
|
|
(longlong_t)dn->dn_phys->dn_maxblkid);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (!dnode_held) {
|
|
|
|
object_viewer[ZDB_OT_TYPE(doi.doi_bonus_type)](os,
|
|
|
|
object, bonus, bsize);
|
|
|
|
} else {
|
|
|
|
(void) printf("\t\t(bonus encrypted)\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!os->os_encrypted || !DMU_OT_IS_ENCRYPTED(doi.doi_type)) {
|
|
|
|
object_viewer[ZDB_OT_TYPE(doi.doi_type)](os, object,
|
|
|
|
NULL, 0);
|
|
|
|
} else {
|
|
|
|
(void) printf("\t\t(object encrypted)\n");
|
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
*print_header = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (verbosity >= 5)
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_indirect(dn);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (verbosity >= 5) {
|
|
|
|
/*
|
|
|
|
* Report the list of segments that comprise the object.
|
|
|
|
*/
|
|
|
|
uint64_t start = 0;
|
|
|
|
uint64_t end;
|
|
|
|
uint64_t blkfill = 1;
|
|
|
|
int minlvl = 1;
|
|
|
|
|
|
|
|
if (dn->dn_type == DMU_OT_DNODE) {
|
|
|
|
minlvl = 0;
|
|
|
|
blkfill = DNODES_PER_BLOCK;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (;;) {
|
2010-05-29 00:45:14 +04:00
|
|
|
char segsize[32];
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (segsize) >= NN_NUMBUF_SZ);
|
2008-12-03 23:09:06 +03:00
|
|
|
error = dnode_next_offset(dn,
|
|
|
|
0, &start, minlvl, blkfill, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
end = start;
|
2008-12-03 23:09:06 +03:00
|
|
|
error = dnode_next_offset(dn,
|
|
|
|
DNODE_FIND_HOLE, &end, minlvl, blkfill, 0);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(end - start, segsize, sizeof (segsize));
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tsegment [%016llx, %016llx)"
|
|
|
|
" size %5s\n", (u_longlong_t)start,
|
|
|
|
(u_longlong_t)end, segsize);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
start = end;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db != NULL)
|
|
|
|
dmu_buf_rele(db, FTAG);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (dnode_held)
|
|
|
|
dnode_rele(dn, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
static void
|
|
|
|
count_dir_mos_objects(dsl_dir_t *dd)
|
|
|
|
{
|
|
|
|
mos_obj_refd(dd->dd_object);
|
|
|
|
mos_obj_refd(dsl_dir_phys(dd)->dd_child_dir_zapobj);
|
|
|
|
mos_obj_refd(dsl_dir_phys(dd)->dd_deleg_zapobj);
|
|
|
|
mos_obj_refd(dsl_dir_phys(dd)->dd_props_zapobj);
|
|
|
|
mos_obj_refd(dsl_dir_phys(dd)->dd_clones);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The dd_crypto_obj can be referenced by multiple dsl_dir's.
|
|
|
|
* Ignore the references after the first one.
|
|
|
|
*/
|
|
|
|
mos_obj_refd_multiple(dd->dd_crypto_obj);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
count_ds_mos_objects(dsl_dataset_t *ds)
|
|
|
|
{
|
|
|
|
mos_obj_refd(ds->ds_object);
|
|
|
|
mos_obj_refd(dsl_dataset_phys(ds)->ds_next_clones_obj);
|
|
|
|
mos_obj_refd(dsl_dataset_phys(ds)->ds_props_obj);
|
|
|
|
mos_obj_refd(dsl_dataset_phys(ds)->ds_userrefs_obj);
|
|
|
|
mos_obj_refd(dsl_dataset_phys(ds)->ds_snapnames_zapobj);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
mos_obj_refd(ds->ds_bookmarks_obj);
|
2018-10-12 21:28:26 +03:00
|
|
|
|
|
|
|
if (!dsl_dataset_is_snapshot(ds)) {
|
|
|
|
count_dir_mos_objects(ds->ds_dir);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char *objset_types[DMU_OST_NUMTYPES] = {
|
2008-11-20 23:01:55 +03:00
|
|
|
"NONE", "META", "ZPL", "ZVOL", "OTHER", "ANY" };
|
|
|
|
|
|
|
|
static void
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_objset(objset_t *os)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dmu_objset_stats_t dds;
|
|
|
|
uint64_t object, object_count;
|
|
|
|
uint64_t refdbytes, usedobjs, scratch;
|
2010-05-29 00:45:14 +04:00
|
|
|
char numbuf[32];
|
2009-07-03 02:44:48 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN + 20];
|
2016-06-16 00:28:36 +03:00
|
|
|
char osname[ZFS_MAX_DATASET_NAME_LEN];
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *type = "UNKNOWN";
|
2008-11-20 23:01:55 +03:00
|
|
|
int verbosity = dump_opt['d'];
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
boolean_t print_header;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
|
|
|
int error;
|
2017-09-06 02:15:04 +03:00
|
|
|
uint64_t total_slots_used = 0;
|
|
|
|
uint64_t max_slot_used = 0;
|
|
|
|
uint64_t dnode_slots;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (numbuf) >= NN_NUMBUF_SZ);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_enter(dmu_objset_pool(os), FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_objset_fast_stat(os, &dds);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_exit(dmu_objset_pool(os), FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
print_header = B_TRUE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dds.dds_type < DMU_OST_NUMTYPES)
|
|
|
|
type = objset_types[dds.dds_type];
|
|
|
|
|
|
|
|
if (dds.dds_type == DMU_OST_META) {
|
|
|
|
dds.dds_creation_txg = TXG_INITIAL;
|
2014-06-06 01:19:08 +04:00
|
|
|
usedobjs = BP_GET_FILL(os->os_rootbp);
|
2015-04-01 18:14:34 +03:00
|
|
|
refdbytes = dsl_dir_phys(os->os_spa->spa_dsl_pool->dp_mos_dir)->
|
|
|
|
dd_used_bytes;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
dmu_objset_space(os, &refdbytes, &scratch, &usedobjs, &scratch);
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(usedobjs, ==, BP_GET_FILL(os->os_rootbp));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(refdbytes, numbuf, sizeof (numbuf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (verbosity >= 4) {
|
2013-12-09 22:37:51 +04:00
|
|
|
(void) snprintf(blkbuf, sizeof (blkbuf), ", rootbp ");
|
|
|
|
(void) snprintf_blkptr(blkbuf + strlen(blkbuf),
|
|
|
|
sizeof (blkbuf) - strlen(blkbuf), os->os_rootbp);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
blkbuf[0] = '\0';
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_objset_name(os, osname);
|
|
|
|
|
|
|
|
(void) printf("Dataset %s [%s], ID %llu, cr_txg %llu, "
|
2018-06-17 03:39:14 +03:00
|
|
|
"%s, %llu objects%s%s\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
osname, type, (u_longlong_t)dmu_objset_id(os),
|
|
|
|
(u_longlong_t)dds.dds_creation_txg,
|
2018-06-17 03:39:14 +03:00
|
|
|
numbuf, (u_longlong_t)usedobjs, blkbuf,
|
|
|
|
(dds.dds_inconsistent) ? " (inconsistent)" : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (zopt_objects != 0) {
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
for (i = 0; i < zopt_objects; i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_object(os, zopt_object[i], verbosity,
|
2017-09-06 02:15:04 +03:00
|
|
|
&print_header, NULL);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['i'] != 0 || verbosity >= 2)
|
|
|
|
dump_intent_log(dmu_objset_zil(os));
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (dmu_objset_ds(os) != NULL) {
|
|
|
|
dsl_dataset_t *ds = dmu_objset_ds(os);
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_blkptr_list(&ds->ds_deadlist, "Deadlist");
|
|
|
|
if (dsl_deadlist_is_open(&ds->ds_dir->dd_livelist) &&
|
|
|
|
!dmu_objset_is_snapshot(os)) {
|
|
|
|
dump_blkptr_list(&ds->ds_dir->dd_livelist, "Livelist");
|
|
|
|
if (verify_dd_livelist(os) != 0)
|
|
|
|
fatal("livelist is incorrect");
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
if (dsl_dataset_remap_deadlist_exists(ds)) {
|
|
|
|
(void) printf("ds_remap_deadlist:\n");
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_blkptr_list(&ds->ds_remap_deadlist, "Deadlist");
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2018-10-12 21:28:26 +03:00
|
|
|
count_ds_mos_objects(ds);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (dmu_objset_ds(os) != NULL)
|
|
|
|
dump_bookmarks(os, verbosity);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (verbosity < 2)
|
|
|
|
return;
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_IS_HOLE(os->os_rootbp))
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, 0, verbosity, &print_header, NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
object_count = 0;
|
2010-08-27 01:24:34 +04:00
|
|
|
if (DMU_USERUSED_DNODE(os) != NULL &&
|
|
|
|
DMU_USERUSED_DNODE(os)->dn_type != 0) {
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, DMU_USERUSED_OBJECT, verbosity, &print_header,
|
|
|
|
NULL);
|
|
|
|
dump_object(os, DMU_GROUPUSED_OBJECT, verbosity, &print_header,
|
|
|
|
NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (DMU_PROJECTUSED_DNODE(os) != NULL &&
|
|
|
|
DMU_PROJECTUSED_DNODE(os)->dn_type != 0)
|
|
|
|
dump_object(os, DMU_PROJECTUSED_OBJECT, verbosity,
|
|
|
|
&print_header, NULL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
object = 0;
|
|
|
|
while ((error = dmu_object_next(os, &object, B_FALSE, 0)) == 0) {
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, object, verbosity, &print_header, &dnode_slots);
|
2008-11-20 23:01:55 +03:00
|
|
|
object_count++;
|
2017-09-06 02:15:04 +03:00
|
|
|
total_slots_used += dnode_slots;
|
|
|
|
max_slot_used = object + dnode_slots - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
(void) printf(" Dnode slots:\n");
|
|
|
|
(void) printf("\tTotal used: %10llu\n",
|
|
|
|
(u_longlong_t)total_slots_used);
|
|
|
|
(void) printf("\tMax used: %10llu\n",
|
|
|
|
(u_longlong_t)max_slot_used);
|
|
|
|
(void) printf("\tPercent empty: %10lf\n",
|
|
|
|
(double)(max_slot_used - total_slots_used)*100 /
|
|
|
|
(double)max_slot_used);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error != ESRCH) {
|
|
|
|
(void) fprintf(stderr, "dmu_object_next() = %d\n", error);
|
|
|
|
abort();
|
|
|
|
}
|
2018-12-05 20:30:28 +03:00
|
|
|
|
|
|
|
ASSERT3U(object_count, ==, usedobjs);
|
|
|
|
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
if (leaked_objects != 0) {
|
|
|
|
(void) printf("%d potentially leaked objects detected\n",
|
|
|
|
leaked_objects);
|
|
|
|
leaked_objects = 0;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uberblock(uberblock_t *ub, const char *header, const char *footer)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
time_t timestamp = ub->ub_timestamp;
|
|
|
|
|
2010-08-26 20:52:40 +04:00
|
|
|
(void) printf("%s", header ? header : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\tmagic = %016llx\n", (u_longlong_t)ub->ub_magic);
|
|
|
|
(void) printf("\tversion = %llu\n", (u_longlong_t)ub->ub_version);
|
|
|
|
(void) printf("\ttxg = %llu\n", (u_longlong_t)ub->ub_txg);
|
|
|
|
(void) printf("\tguid_sum = %llu\n", (u_longlong_t)ub->ub_guid_sum);
|
|
|
|
(void) printf("\ttimestamp = %llu UTC = %s",
|
|
|
|
(u_longlong_t)ub->ub_timestamp, asctime(localtime(×tamp)));
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
|
|
|
(void) printf("\tmmp_magic = %016llx\n",
|
|
|
|
(u_longlong_t)ub->ub_mmp_magic);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
if (MMP_VALID(ub)) {
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
(void) printf("\tmmp_delay = %0llu\n",
|
|
|
|
(u_longlong_t)ub->ub_mmp_delay);
|
MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 22:47:57 +03:00
|
|
|
if (MMP_SEQ_VALID(ub))
|
|
|
|
(void) printf("\tmmp_seq = %u\n",
|
|
|
|
(unsigned int) MMP_SEQ(ub));
|
|
|
|
if (MMP_FAIL_INT_VALID(ub))
|
|
|
|
(void) printf("\tmmp_fail = %u\n",
|
|
|
|
(unsigned int) MMP_FAIL_INT(ub));
|
|
|
|
if (MMP_INTERVAL_VALID(ub))
|
|
|
|
(void) printf("\tmmp_write = %u\n",
|
|
|
|
(unsigned int) MMP_INTERVAL(ub));
|
|
|
|
/* After MMP_* to make summarize_uberblock_mmp cleaner */
|
|
|
|
(void) printf("\tmmp_valid = %x\n",
|
|
|
|
(unsigned int) ub->ub_mmp_config & 0xFF);
|
|
|
|
}
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
if (dump_opt['u'] >= 4) {
|
2008-11-20 23:01:55 +03:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), &ub->ub_rootbp);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\trootbp = %s\n", blkbuf);
|
|
|
|
}
|
2016-12-17 01:11:29 +03:00
|
|
|
(void) printf("\tcheckpoint_txg = %llu\n",
|
|
|
|
(u_longlong_t)ub->ub_checkpoint_txg);
|
2010-08-26 20:52:40 +04:00
|
|
|
(void) printf("%s", footer ? footer : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_config(spa_t *spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_t *db;
|
|
|
|
size_t nvsize = 0;
|
|
|
|
int error = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_bonus_hold(spa->spa_meta_objset,
|
|
|
|
spa->spa_config_object, FTAG, &db);
|
|
|
|
|
|
|
|
if (error == 0) {
|
|
|
|
nvsize = *(uint64_t *)db->db_data;
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
(void) printf("\nMOS Configuration:\n");
|
|
|
|
dump_packed_nvlist(spa->spa_meta_objset,
|
|
|
|
spa->spa_config_object, (void *)&nvsize, 1);
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "dmu_bonus_hold(%llu) failed, errno %d",
|
|
|
|
(u_longlong_t)spa->spa_config_object, error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_cachefile(const char *cachefile)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
struct stat64 statbuf;
|
|
|
|
char *buf;
|
|
|
|
nvlist_t *config;
|
|
|
|
|
|
|
|
if ((fd = open64(cachefile, O_RDONLY)) < 0) {
|
|
|
|
(void) printf("cannot open '%s': %s\n", cachefile,
|
|
|
|
strerror(errno));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fstat64(fd, &statbuf) != 0) {
|
|
|
|
(void) printf("failed to stat '%s': %s\n", cachefile,
|
|
|
|
strerror(errno));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((buf = malloc(statbuf.st_size)) == NULL) {
|
|
|
|
(void) fprintf(stderr, "failed to allocate %llu bytes\n",
|
|
|
|
(u_longlong_t)statbuf.st_size);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (read(fd, buf, statbuf.st_size) != statbuf.st_size) {
|
|
|
|
(void) fprintf(stderr, "failed to read %llu bytes\n",
|
|
|
|
(u_longlong_t)statbuf.st_size);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) close(fd);
|
|
|
|
|
|
|
|
if (nvlist_unpack(buf, statbuf.st_size, &config, 0) != 0) {
|
|
|
|
(void) fprintf(stderr, "failed to unpack nvlist\n");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
free(buf);
|
|
|
|
|
|
|
|
dump_nvlist(config, 0);
|
|
|
|
|
|
|
|
nvlist_free(config);
|
|
|
|
}
|
|
|
|
|
2017-02-03 01:03:48 +03:00
|
|
|
/*
|
|
|
|
* ZFS label nvlist stats
|
|
|
|
*/
|
|
|
|
typedef struct zdb_nvl_stats {
|
|
|
|
int zns_list_count;
|
|
|
|
int zns_leaf_count;
|
|
|
|
size_t zns_leaf_largest;
|
|
|
|
size_t zns_leaf_total;
|
|
|
|
nvlist_t *zns_string;
|
|
|
|
nvlist_t *zns_uint64;
|
|
|
|
nvlist_t *zns_boolean;
|
|
|
|
} zdb_nvl_stats_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
collect_nvlist_stats(nvlist_t *nvl, zdb_nvl_stats_t *stats)
|
|
|
|
{
|
|
|
|
nvlist_t *list, **array;
|
|
|
|
nvpair_t *nvp = NULL;
|
|
|
|
char *name;
|
|
|
|
uint_t i, items;
|
|
|
|
|
|
|
|
stats->zns_list_count++;
|
|
|
|
|
|
|
|
while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) {
|
|
|
|
name = nvpair_name(nvp);
|
|
|
|
|
|
|
|
switch (nvpair_type(nvp)) {
|
|
|
|
case DATA_TYPE_STRING:
|
|
|
|
fnvlist_add_string(stats->zns_string, name,
|
|
|
|
fnvpair_value_string(nvp));
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_UINT64:
|
|
|
|
fnvlist_add_uint64(stats->zns_uint64, name,
|
|
|
|
fnvpair_value_uint64(nvp));
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_BOOLEAN:
|
|
|
|
fnvlist_add_boolean(stats->zns_boolean, name);
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_NVLIST:
|
|
|
|
if (nvpair_value_nvlist(nvp, &list) == 0)
|
|
|
|
collect_nvlist_stats(list, stats);
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_NVLIST_ARRAY:
|
|
|
|
if (nvpair_value_nvlist_array(nvp, &array, &items) != 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
for (i = 0; i < items; i++) {
|
|
|
|
collect_nvlist_stats(array[i], stats);
|
|
|
|
|
|
|
|
/* collect stats on leaf vdev */
|
|
|
|
if (strcmp(name, "children") == 0) {
|
|
|
|
size_t size;
|
|
|
|
|
|
|
|
(void) nvlist_size(array[i], &size,
|
|
|
|
NV_ENCODE_XDR);
|
|
|
|
stats->zns_leaf_total += size;
|
|
|
|
if (size > stats->zns_leaf_largest)
|
|
|
|
stats->zns_leaf_largest = size;
|
|
|
|
stats->zns_leaf_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
(void) printf("skip type %d!\n", (int)nvpair_type(nvp));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_nvlist_stats(nvlist_t *nvl, size_t cap)
|
|
|
|
{
|
|
|
|
zdb_nvl_stats_t stats = { 0 };
|
|
|
|
size_t size, sum = 0, total;
|
2017-02-07 20:29:47 +03:00
|
|
|
size_t noise;
|
2017-02-03 01:03:48 +03:00
|
|
|
|
|
|
|
/* requires nvlist with non-unique names for stat collection */
|
|
|
|
VERIFY0(nvlist_alloc(&stats.zns_string, 0, 0));
|
|
|
|
VERIFY0(nvlist_alloc(&stats.zns_uint64, 0, 0));
|
|
|
|
VERIFY0(nvlist_alloc(&stats.zns_boolean, 0, 0));
|
|
|
|
VERIFY0(nvlist_size(stats.zns_boolean, &noise, NV_ENCODE_XDR));
|
|
|
|
|
|
|
|
(void) printf("\n\nZFS Label NVList Config Stats:\n");
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(nvl, &total, NV_ENCODE_XDR));
|
|
|
|
(void) printf(" %d bytes used, %d bytes free (using %4.1f%%)\n\n",
|
|
|
|
(int)total, (int)(cap - total), 100.0 * total / cap);
|
|
|
|
|
|
|
|
collect_nvlist_stats(nvl, &stats);
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(stats.zns_uint64, &size, NV_ENCODE_XDR));
|
|
|
|
size -= noise;
|
|
|
|
sum += size;
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n", "integers:",
|
|
|
|
(int)fnvlist_num_pairs(stats.zns_uint64),
|
|
|
|
(int)size, 100.0 * size / total);
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(stats.zns_string, &size, NV_ENCODE_XDR));
|
|
|
|
size -= noise;
|
|
|
|
sum += size;
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n", "strings:",
|
|
|
|
(int)fnvlist_num_pairs(stats.zns_string),
|
|
|
|
(int)size, 100.0 * size / total);
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(stats.zns_boolean, &size, NV_ENCODE_XDR));
|
|
|
|
size -= noise;
|
|
|
|
sum += size;
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n", "booleans:",
|
|
|
|
(int)fnvlist_num_pairs(stats.zns_boolean),
|
|
|
|
(int)size, 100.0 * size / total);
|
|
|
|
|
|
|
|
size = total - sum; /* treat remainder as nvlist overhead */
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n\n", "nvlists:",
|
|
|
|
stats.zns_list_count, (int)size, 100.0 * size / total);
|
|
|
|
|
2017-02-07 20:29:47 +03:00
|
|
|
if (stats.zns_leaf_count > 0) {
|
|
|
|
size_t average = stats.zns_leaf_total / stats.zns_leaf_count;
|
2017-02-03 01:03:48 +03:00
|
|
|
|
2017-02-07 20:29:47 +03:00
|
|
|
(void) printf("%12s %4d %6d bytes average\n", "leaf vdevs:",
|
|
|
|
stats.zns_leaf_count, (int)average);
|
|
|
|
(void) printf("%24d bytes largest\n",
|
|
|
|
(int)stats.zns_leaf_largest);
|
2017-02-03 01:03:48 +03:00
|
|
|
|
2017-02-07 20:29:47 +03:00
|
|
|
if (dump_opt['l'] >= 3 && average > 0)
|
|
|
|
(void) printf(" space for %d additional leaf vdevs\n",
|
|
|
|
(int)((cap - total) / average));
|
|
|
|
}
|
2017-02-03 01:03:48 +03:00
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
nvlist_free(stats.zns_string);
|
|
|
|
nvlist_free(stats.zns_uint64);
|
|
|
|
nvlist_free(stats.zns_boolean);
|
|
|
|
}
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
typedef struct cksum_record {
|
|
|
|
zio_cksum_t cksum;
|
|
|
|
boolean_t labels[VDEV_LABELS];
|
|
|
|
avl_node_t link;
|
|
|
|
} cksum_record_t;
|
|
|
|
|
2017-02-04 01:18:28 +03:00
|
|
|
static int
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
cksum_record_compare(const void *x1, const void *x2)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
const cksum_record_t *l = (cksum_record_t *)x1;
|
|
|
|
const cksum_record_t *r = (cksum_record_t *)x2;
|
|
|
|
int arraysize = ARRAY_SIZE(l->cksum.zc_word);
|
|
|
|
int difference;
|
|
|
|
|
|
|
|
for (int i = 0; i < arraysize; i++) {
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
difference = TREE_CMP(l->cksum.zc_word[i], r->cksum.zc_word[i]);
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
if (difference)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (difference);
|
|
|
|
}
|
|
|
|
|
|
|
|
static cksum_record_t *
|
|
|
|
cksum_record_alloc(zio_cksum_t *cksum, int l)
|
|
|
|
{
|
|
|
|
cksum_record_t *rec;
|
|
|
|
|
|
|
|
rec = umem_zalloc(sizeof (*rec), UMEM_NOFAIL);
|
|
|
|
rec->cksum = *cksum;
|
|
|
|
rec->labels[l] = B_TRUE;
|
|
|
|
|
|
|
|
return (rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
static cksum_record_t *
|
|
|
|
cksum_record_lookup(avl_tree_t *tree, zio_cksum_t *cksum)
|
|
|
|
{
|
|
|
|
cksum_record_t lookup = { .cksum = *cksum };
|
|
|
|
avl_index_t where;
|
|
|
|
|
|
|
|
return (avl_find(tree, &lookup, &where));
|
|
|
|
}
|
|
|
|
|
|
|
|
static cksum_record_t *
|
|
|
|
cksum_record_insert(avl_tree_t *tree, zio_cksum_t *cksum, int l)
|
|
|
|
{
|
|
|
|
cksum_record_t *rec;
|
|
|
|
|
|
|
|
rec = cksum_record_lookup(tree, cksum);
|
|
|
|
if (rec) {
|
|
|
|
rec->labels[l] = B_TRUE;
|
|
|
|
} else {
|
|
|
|
rec = cksum_record_alloc(cksum, l);
|
|
|
|
avl_add(tree, rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
first_label(cksum_record_t *rec)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < VDEV_LABELS; i++)
|
|
|
|
if (rec->labels[i])
|
|
|
|
return (i);
|
|
|
|
|
|
|
|
return (-1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
print_label_numbers(char *prefix, cksum_record_t *rec)
|
|
|
|
{
|
|
|
|
printf("%s", prefix);
|
|
|
|
for (int i = 0; i < VDEV_LABELS; i++)
|
|
|
|
if (rec->labels[i] == B_TRUE)
|
|
|
|
printf("%d ", i);
|
|
|
|
printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
#define MAX_UBERBLOCK_COUNT (VDEV_UBERBLOCK_RING >> UBERBLOCK_SHIFT)
|
|
|
|
|
2019-02-13 22:28:36 +03:00
|
|
|
typedef struct zdb_label {
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_label_t label;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
nvlist_t *config_nv;
|
|
|
|
cksum_record_t *config;
|
|
|
|
cksum_record_t *uberblocks[MAX_UBERBLOCK_COUNT];
|
|
|
|
boolean_t header_printed;
|
|
|
|
boolean_t read_failed;
|
2019-02-13 22:28:36 +03:00
|
|
|
} zdb_label_t;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
|
|
|
|
static void
|
2019-02-13 22:28:36 +03:00
|
|
|
print_label_header(zdb_label_t *label, int l)
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
{
|
|
|
|
|
|
|
|
if (dump_opt['q'])
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (label->header_printed == B_TRUE)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("------------------------------------\n");
|
|
|
|
(void) printf("LABEL %d\n", l);
|
|
|
|
(void) printf("------------------------------------\n");
|
|
|
|
|
|
|
|
label->header_printed = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2019-02-13 22:28:36 +03:00
|
|
|
dump_config_from_label(zdb_label_t *label, size_t buflen, int l)
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
{
|
|
|
|
if (dump_opt['q'])
|
|
|
|
return;
|
|
|
|
|
|
|
|
if ((dump_opt['l'] < 3) && (first_label(label->config) != l))
|
|
|
|
return;
|
|
|
|
|
|
|
|
print_label_header(label, l);
|
|
|
|
dump_nvlist(label->config_nv, 4);
|
|
|
|
print_label_numbers(" labels = ", label->config);
|
|
|
|
|
|
|
|
if (dump_opt['l'] >= 2)
|
|
|
|
dump_nvlist_stats(label->config_nv, buflen);
|
|
|
|
}
|
|
|
|
|
|
|
|
#define ZDB_MAX_UB_HEADER_SIZE 32
|
|
|
|
|
|
|
|
static void
|
2019-02-13 22:28:36 +03:00
|
|
|
dump_label_uberblocks(zdb_label_t *label, uint64_t ashift, int label_num)
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
{
|
|
|
|
|
|
|
|
vdev_t vd;
|
|
|
|
char header[ZDB_MAX_UB_HEADER_SIZE];
|
|
|
|
|
|
|
|
vd.vdev_ashift = ashift;
|
|
|
|
vd.vdev_top = &vd;
|
|
|
|
|
|
|
|
for (int i = 0; i < VDEV_UBERBLOCK_COUNT(&vd); i++) {
|
|
|
|
uint64_t uoff = VDEV_UBERBLOCK_OFFSET(&vd, i);
|
|
|
|
uberblock_t *ub = (void *)((char *)&label->label + uoff);
|
|
|
|
cksum_record_t *rec = label->uberblocks[i];
|
|
|
|
|
|
|
|
if (rec == NULL) {
|
|
|
|
if (dump_opt['u'] >= 2) {
|
|
|
|
print_label_header(label, label_num);
|
|
|
|
(void) printf(" Uberblock[%d] invalid\n", i);
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((dump_opt['u'] < 3) && (first_label(rec) != label_num))
|
|
|
|
continue;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if ((dump_opt['u'] < 4) &&
|
|
|
|
(ub->ub_mmp_magic == MMP_MAGIC) && ub->ub_mmp_delay &&
|
|
|
|
(i >= VDEV_UBERBLOCK_COUNT(&vd) - MMP_BLOCKS_PER_LABEL))
|
|
|
|
continue;
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
print_label_header(label, label_num);
|
|
|
|
(void) snprintf(header, ZDB_MAX_UB_HEADER_SIZE,
|
|
|
|
" Uberblock[%d]\n", i);
|
|
|
|
dump_uberblock(ub, header, "");
|
|
|
|
print_label_numbers(" labels = ", rec);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
static char curpath[PATH_MAX];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate through the path components, recursively passing
|
|
|
|
* current one's obj and remaining path until we find the obj
|
|
|
|
* for the last one.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dump_path_impl(objset_t *os, uint64_t obj, char *name)
|
|
|
|
{
|
|
|
|
int err;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
boolean_t header = B_TRUE;
|
2017-04-13 19:40:56 +03:00
|
|
|
uint64_t child_obj;
|
|
|
|
char *s;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
|
|
|
|
if ((s = strchr(name, '/')) != NULL)
|
|
|
|
*s = '\0';
|
|
|
|
err = zap_lookup(os, obj, name, 8, 1, &child_obj);
|
|
|
|
|
|
|
|
(void) strlcat(curpath, name, sizeof (curpath));
|
|
|
|
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "failed to lookup %s: %s\n",
|
|
|
|
curpath, strerror(err));
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
child_obj = ZFS_DIRENT_OBJ(child_obj);
|
|
|
|
err = sa_buf_hold(os, child_obj, FTAG, &db);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"failed to get SA dbuf for obj %llu: %s\n",
|
|
|
|
(u_longlong_t)child_obj, strerror(err));
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
sa_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
if (doi.doi_bonus_type != DMU_OT_SA &&
|
|
|
|
doi.doi_bonus_type != DMU_OT_ZNODE) {
|
|
|
|
(void) fprintf(stderr, "invalid bonus type %d for obj %llu\n",
|
|
|
|
doi.doi_bonus_type, (u_longlong_t)child_obj);
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dump_opt['v'] > 6) {
|
|
|
|
(void) printf("obj=%llu %s type=%d bonustype=%d\n",
|
|
|
|
(u_longlong_t)child_obj, curpath, doi.doi_type,
|
|
|
|
doi.doi_bonus_type);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) strlcat(curpath, "/", sizeof (curpath));
|
|
|
|
|
|
|
|
switch (doi.doi_type) {
|
|
|
|
case DMU_OT_DIRECTORY_CONTENTS:
|
|
|
|
if (s != NULL && *(s + 1) != '\0')
|
|
|
|
return (dump_path_impl(os, child_obj, s + 1));
|
|
|
|
/*FALLTHROUGH*/
|
|
|
|
case DMU_OT_PLAIN_FILE_CONTENTS:
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, child_obj, dump_opt['v'], &header, NULL);
|
2017-04-13 19:40:56 +03:00
|
|
|
return (0);
|
|
|
|
default:
|
|
|
|
(void) fprintf(stderr, "object %llu has non-file/directory "
|
|
|
|
"type %d\n", (u_longlong_t)obj, doi.doi_type);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Dump the blocks for the object specified by path inside the dataset.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dump_path(char *ds, char *path)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
objset_t *os;
|
|
|
|
uint64_t root_obj;
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
err = open_objset(ds, FTAG, &os);
|
2017-04-13 19:40:56 +03:00
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
err = zap_lookup(os, MASTER_NODE_OBJ, ZFS_ROOT_OBJ, 8, 1, &root_obj);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "can't lookup root znode: %s\n",
|
|
|
|
strerror(err));
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
close_objset(os, FTAG);
|
2017-04-13 19:40:56 +03:00
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) snprintf(curpath, sizeof (curpath), "dataset=%s path=/", ds);
|
|
|
|
|
|
|
|
err = dump_path_impl(os, root_obj, path);
|
|
|
|
|
|
|
|
close_objset(os, FTAG);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
static int
|
|
|
|
dump_label(const char *dev)
|
|
|
|
{
|
2017-02-04 01:18:28 +03:00
|
|
|
char path[MAXPATHLEN];
|
2019-02-13 22:28:36 +03:00
|
|
|
zdb_label_t labels[VDEV_LABELS];
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t psize, ashift;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
struct stat64 statbuf;
|
|
|
|
boolean_t config_found = B_FALSE;
|
|
|
|
boolean_t error = B_FALSE;
|
|
|
|
avl_tree_t config_tree;
|
|
|
|
avl_tree_t uberblock_tree;
|
|
|
|
void *node, *cookie;
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
bzero(labels, sizeof (labels));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-01-13 20:25:15 +03:00
|
|
|
/*
|
|
|
|
* Check if we were given absolute path and use it as is.
|
|
|
|
* Otherwise if the provided vdev name doesn't point to a file,
|
|
|
|
* try prepending expected disk paths and partition numbers.
|
|
|
|
*/
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) strlcpy(path, dev, sizeof (path));
|
2017-01-13 20:25:15 +03:00
|
|
|
if (dev[0] != '/' && stat64(path, &statbuf) != 0) {
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = zfs_resolve_shortname(dev, path, MAXPATHLEN);
|
|
|
|
if (error == 0 && zfs_dev_is_whole_disk(path)) {
|
|
|
|
if (zfs_append_partition(path, MAXPATHLEN) == -1)
|
|
|
|
error = ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error || (stat64(path, &statbuf) != 0)) {
|
|
|
|
(void) printf("failed to find device %s, try "
|
|
|
|
"specifying absolute path instead\n", dev);
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if ((fd = open64(path, O_RDONLY)) < 0) {
|
|
|
|
(void) printf("cannot open '%s': %s\n", path, strerror(errno));
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
2010-12-14 20:50:37 +03:00
|
|
|
if (fstat64_blk(fd, &statbuf) != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("failed to stat '%s': %s\n", path,
|
2008-11-20 23:01:55 +03:00
|
|
|
strerror(errno));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) close(fd);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
2017-12-19 21:49:33 +03:00
|
|
|
if (S_ISBLK(statbuf.st_mode) && ioctl(fd, BLKFLSBUF) != 0)
|
|
|
|
(void) printf("failed to invalidate cache '%s' : %s\n", path,
|
|
|
|
strerror(errno));
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
avl_create(&config_tree, cksum_record_compare,
|
|
|
|
sizeof (cksum_record_t), offsetof(cksum_record_t, link));
|
|
|
|
avl_create(&uberblock_tree, cksum_record_compare,
|
|
|
|
sizeof (cksum_record_t), offsetof(cksum_record_t, link));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
psize = statbuf.st_size;
|
|
|
|
psize = P2ALIGN(psize, (uint64_t)sizeof (vdev_label_t));
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
ashift = SPA_MINBLOCKSHIFT;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
/*
|
|
|
|
* 1. Read the label from disk
|
|
|
|
* 2. Unpack the configuration and insert in config tree.
|
|
|
|
* 3. Traverse all uberblocks and insert in uberblock tree.
|
|
|
|
*/
|
|
|
|
for (int l = 0; l < VDEV_LABELS; l++) {
|
2019-02-13 22:28:36 +03:00
|
|
|
zdb_label_t *label = &labels[l];
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
char *buf = label->label.vl_vdev_phys.vp_nvlist;
|
|
|
|
size_t buflen = sizeof (label->label.vl_vdev_phys.vp_nvlist);
|
|
|
|
nvlist_t *config;
|
|
|
|
cksum_record_t *rec;
|
|
|
|
zio_cksum_t cksum;
|
|
|
|
vdev_t vd;
|
|
|
|
|
|
|
|
if (pread64(fd, &label->label, sizeof (label->label),
|
|
|
|
vdev_label_offset(psize, l, 0)) != sizeof (label->label)) {
|
2017-02-04 01:18:28 +03:00
|
|
|
if (!dump_opt['q'])
|
|
|
|
(void) printf("failed to read label %d\n", l);
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
label->read_failed = B_TRUE;
|
|
|
|
error = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
label->read_failed = B_FALSE;
|
|
|
|
|
|
|
|
if (nvlist_unpack(buf, buflen, &config, 0) == 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *vdev_tree = NULL;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
size_t size;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if ((nvlist_lookup_nvlist(config,
|
|
|
|
ZPOOL_CONFIG_VDEV_TREE, &vdev_tree) != 0) ||
|
|
|
|
(nvlist_lookup_uint64(vdev_tree,
|
|
|
|
ZPOOL_CONFIG_ASHIFT, &ashift) != 0))
|
|
|
|
ashift = SPA_MINBLOCKSHIFT;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
|
|
|
|
if (nvlist_size(config, &size, NV_ENCODE_XDR) != 0)
|
|
|
|
size = buflen;
|
|
|
|
|
|
|
|
fletcher_4_native_varsize(buf, size, &cksum);
|
|
|
|
rec = cksum_record_insert(&config_tree, &cksum, l);
|
|
|
|
|
|
|
|
label->config = rec;
|
|
|
|
label->config_nv = config;
|
|
|
|
config_found = B_TRUE;
|
|
|
|
} else {
|
|
|
|
error = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
|
|
|
|
vd.vdev_ashift = ashift;
|
|
|
|
vd.vdev_top = &vd;
|
|
|
|
|
|
|
|
for (int i = 0; i < VDEV_UBERBLOCK_COUNT(&vd); i++) {
|
|
|
|
uint64_t uoff = VDEV_UBERBLOCK_OFFSET(&vd, i);
|
|
|
|
uberblock_t *ub = (void *)((char *)label + uoff);
|
|
|
|
|
|
|
|
if (uberblock_verify(ub))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
fletcher_4_native_varsize(ub, sizeof (*ub), &cksum);
|
|
|
|
rec = cksum_record_insert(&uberblock_tree, &cksum, l);
|
|
|
|
|
|
|
|
label->uberblocks[i] = rec;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Dump the label and uberblocks.
|
|
|
|
*/
|
|
|
|
for (int l = 0; l < VDEV_LABELS; l++) {
|
2019-02-13 22:28:36 +03:00
|
|
|
zdb_label_t *label = &labels[l];
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
size_t buflen = sizeof (label->label.vl_vdev_phys.vp_nvlist);
|
|
|
|
|
|
|
|
if (label->read_failed == B_TRUE)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (label->config_nv) {
|
|
|
|
dump_config_from_label(label, buflen, l);
|
|
|
|
} else {
|
|
|
|
if (!dump_opt['q'])
|
|
|
|
(void) printf("failed to unpack label %d\n", l);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['u'])
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
dump_label_uberblocks(label, ashift, l);
|
|
|
|
|
|
|
|
nvlist_free(label->config_nv);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
cookie = NULL;
|
|
|
|
while ((node = avl_destroy_nodes(&config_tree, &cookie)) != NULL)
|
|
|
|
umem_free(node, sizeof (cksum_record_t));
|
|
|
|
|
|
|
|
cookie = NULL;
|
|
|
|
while ((node = avl_destroy_nodes(&uberblock_tree, &cookie)) != NULL)
|
|
|
|
umem_free(node, sizeof (cksum_record_t));
|
|
|
|
|
|
|
|
avl_destroy(&config_tree);
|
|
|
|
avl_destroy(&uberblock_tree);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) close(fd);
|
2017-02-04 01:18:28 +03:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
return (config_found == B_FALSE ? 2 :
|
|
|
|
(error == B_TRUE ? 1 : 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-07-24 19:53:55 +03:00
|
|
|
static uint64_t dataset_feature_count[SPA_FEATURES];
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
static uint64_t global_feature_count[SPA_FEATURES];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static uint64_t remap_deadlist_count = 0;
|
2014-11-03 23:15:08 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static int
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_one_objset(const char *dsname, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
objset_t *os;
|
2015-07-24 19:53:55 +03:00
|
|
|
spa_feature_t f;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
error = open_objset(dsname, FTAG, &os);
|
2017-04-13 19:40:56 +03:00
|
|
|
if (error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
2015-07-24 19:53:55 +03:00
|
|
|
|
|
|
|
for (f = 0; f < SPA_FEATURES; f++) {
|
2018-10-16 21:15:04 +03:00
|
|
|
if (!dsl_dataset_feature_is_active(dmu_objset_ds(os), f))
|
2015-07-24 19:53:55 +03:00
|
|
|
continue;
|
|
|
|
ASSERT(spa_feature_table[f].fi_flags &
|
|
|
|
ZFEATURE_FLAG_PER_DATASET);
|
|
|
|
dataset_feature_count[f]++;
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (dsl_dataset_remap_deadlist_exists(dmu_objset_ds(os))) {
|
|
|
|
remap_deadlist_count++;
|
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
for (dsl_bookmark_node_t *dbn =
|
|
|
|
avl_first(&dmu_objset_ds(os)->ds_bookmarks); dbn != NULL;
|
|
|
|
dbn = AVL_NEXT(&dmu_objset_ds(os)->ds_bookmarks, dbn)) {
|
|
|
|
mos_obj_refd(dbn->dbn_phys.zbm_redaction_obj);
|
|
|
|
if (dbn->dbn_phys.zbm_redaction_obj != 0)
|
|
|
|
global_feature_count[SPA_FEATURE_REDACTION_BOOKMARKS]++;
|
|
|
|
if (dbn->dbn_phys.zbm_flags & ZBM_FLAG_HAS_FBN)
|
|
|
|
global_feature_count[SPA_FEATURE_BOOKMARK_WRITTEN]++;
|
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
if (dsl_deadlist_is_open(&dmu_objset_ds(os)->ds_dir->dd_livelist) &&
|
|
|
|
!dmu_objset_is_snapshot(os)) {
|
|
|
|
global_feature_count[SPA_FEATURE_LIVELIST]++;
|
|
|
|
}
|
|
|
|
|
|
|
|
dump_objset(os);
|
2017-04-13 19:40:56 +03:00
|
|
|
close_objset(os, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
fuid_table_destroy();
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Block statistics.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-11-03 23:15:08 +03:00
|
|
|
#define PSIZE_HISTO_SIZE (SPA_OLD_MAXBLOCKSIZE / SPA_MINBLOCKSIZE + 2)
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct zdb_blkstats {
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zb_asize;
|
|
|
|
uint64_t zb_lsize;
|
|
|
|
uint64_t zb_psize;
|
|
|
|
uint64_t zb_count;
|
2014-11-03 22:12:40 +03:00
|
|
|
uint64_t zb_gangs;
|
|
|
|
uint64_t zb_ditto_samevdev;
|
2018-09-06 04:33:36 +03:00
|
|
|
uint64_t zb_ditto_same_ms;
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zb_psize_histogram[PSIZE_HISTO_SIZE];
|
2008-11-20 23:01:55 +03:00
|
|
|
} zdb_blkstats_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Extended object types to report deferred frees and dedup auto-ditto blocks.
|
|
|
|
*/
|
|
|
|
#define ZDB_OT_DEFERRED (DMU_OT_NUMTYPES + 0)
|
|
|
|
#define ZDB_OT_DITTO (DMU_OT_NUMTYPES + 1)
|
2012-12-14 03:24:15 +04:00
|
|
|
#define ZDB_OT_OTHER (DMU_OT_NUMTYPES + 2)
|
|
|
|
#define ZDB_OT_TOTAL (DMU_OT_NUMTYPES + 3)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char *zdb_ot_extname[] = {
|
2010-05-29 00:45:14 +04:00
|
|
|
"deferred free",
|
|
|
|
"dedup ditto",
|
2012-12-14 03:24:15 +04:00
|
|
|
"other",
|
2010-05-29 00:45:14 +04:00
|
|
|
"Total",
|
|
|
|
};
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
#define ZB_TOTAL DN_MAX_LEVELS
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
typedef struct zdb_cb {
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_blkstats_t zcb_type[ZB_TOTAL + 1][ZDB_OT_TOTAL + 1];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint64_t zcb_removing_size;
|
2016-12-17 01:11:29 +03:00
|
|
|
uint64_t zcb_checkpoint_size;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zcb_dedup_asize;
|
|
|
|
uint64_t zcb_dedup_blocks;
|
2014-06-06 01:19:08 +04:00
|
|
|
uint64_t zcb_embedded_blocks[NUM_BP_EMBEDDED_TYPES];
|
|
|
|
uint64_t zcb_embedded_histogram[NUM_BP_EMBEDDED_TYPES]
|
2016-08-04 17:23:35 +03:00
|
|
|
[BPE_PAYLOAD_SIZE + 1];
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zcb_start;
|
2017-10-27 22:46:35 +03:00
|
|
|
hrtime_t zcb_lastprint;
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zcb_totalasize;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t zcb_errors[256];
|
|
|
|
int zcb_readfails;
|
|
|
|
int zcb_haderrors;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *zcb_spa;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint32_t **zcb_vd_obsolete_counts;
|
2008-11-20 23:01:55 +03:00
|
|
|
} zdb_cb_t;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
/* test if two DVA offsets from same vdev are within the same metaslab */
|
|
|
|
static boolean_t
|
|
|
|
same_metaslab(spa_t *spa, uint64_t vdev, uint64_t off1, uint64_t off2)
|
|
|
|
{
|
|
|
|
vdev_t *vd = vdev_lookup_top(spa, vdev);
|
|
|
|
uint64_t ms_shift = vd->vdev_ms_shift;
|
|
|
|
|
|
|
|
return ((off1 >> ms_shift) == (off2 >> ms_shift));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_count_block(zdb_cb_t *zcb, zilog_t *zilog, const blkptr_t *bp,
|
|
|
|
dmu_object_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t refcnt = 0;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ASSERT(type < ZDB_OT_TOTAL);
|
|
|
|
|
|
|
|
if (zilog && zil_bp_tree_add(zilog, bp) != 0)
|
|
|
|
return;
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
spa_config_enter(zcb->zcb_spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < 4; i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
int l = (i < 2) ? BP_GET_LEVEL(bp) : ZB_TOTAL;
|
2010-05-29 00:45:14 +04:00
|
|
|
int t = (i & 1) ? type : ZDB_OT_TOTAL;
|
2014-11-03 22:12:40 +03:00
|
|
|
int equal;
|
2008-11-20 23:01:55 +03:00
|
|
|
zdb_blkstats_t *zb = &zcb->zcb_type[l][t];
|
|
|
|
|
|
|
|
zb->zb_asize += BP_GET_ASIZE(bp);
|
|
|
|
zb->zb_lsize += BP_GET_LSIZE(bp);
|
|
|
|
zb->zb_psize += BP_GET_PSIZE(bp);
|
|
|
|
zb->zb_count++;
|
2014-11-03 23:15:08 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The histogram is only big enough to record blocks up to
|
|
|
|
* SPA_OLD_MAXBLOCKSIZE; larger blocks go into the last,
|
|
|
|
* "other", bucket.
|
|
|
|
*/
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned idx = BP_GET_PSIZE(bp) >> SPA_MINBLOCKSHIFT;
|
2014-11-03 23:15:08 +03:00
|
|
|
idx = MIN(idx, SPA_OLD_MAXBLOCKSIZE / SPA_MINBLOCKSIZE + 1);
|
|
|
|
zb->zb_psize_histogram[idx]++;
|
2014-11-03 22:12:40 +03:00
|
|
|
|
|
|
|
zb->zb_gangs += BP_COUNT_GANG(bp);
|
|
|
|
|
|
|
|
switch (BP_GET_NDVAS(bp)) {
|
|
|
|
case 2:
|
|
|
|
if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
2018-09-06 04:33:36 +03:00
|
|
|
DVA_GET_VDEV(&bp->blk_dva[1])) {
|
2014-11-03 22:12:40 +03:00
|
|
|
zb->zb_ditto_samevdev++;
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
if (same_metaslab(zcb->zcb_spa,
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[0]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[0]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[1])))
|
|
|
|
zb->zb_ditto_same_ms++;
|
|
|
|
}
|
2014-11-03 22:12:40 +03:00
|
|
|
break;
|
|
|
|
case 3:
|
|
|
|
equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[1])) +
|
|
|
|
(DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[2])) +
|
|
|
|
(DVA_GET_VDEV(&bp->blk_dva[1]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[2]));
|
2018-09-06 04:33:36 +03:00
|
|
|
if (equal != 0) {
|
2014-11-03 22:12:40 +03:00
|
|
|
zb->zb_ditto_samevdev++;
|
2018-09-06 04:33:36 +03:00
|
|
|
|
|
|
|
if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[1]) &&
|
|
|
|
same_metaslab(zcb->zcb_spa,
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[0]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[0]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[1])))
|
|
|
|
zb->zb_ditto_same_ms++;
|
|
|
|
else if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[2]) &&
|
|
|
|
same_metaslab(zcb->zcb_spa,
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[0]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[0]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[2])))
|
|
|
|
zb->zb_ditto_same_ms++;
|
|
|
|
else if (DVA_GET_VDEV(&bp->blk_dva[1]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[2]) &&
|
|
|
|
same_metaslab(zcb->zcb_spa,
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[1]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[1]),
|
|
|
|
DVA_GET_OFFSET(&bp->blk_dva[2])))
|
|
|
|
zb->zb_ditto_same_ms++;
|
|
|
|
}
|
2014-11-03 22:12:40 +03:00
|
|
|
break;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
spa_config_exit(zcb->zcb_spa, SCL_CONFIG, FTAG);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_EMBEDDED(bp)) {
|
|
|
|
zcb->zcb_embedded_blocks[BPE_GET_ETYPE(bp)]++;
|
|
|
|
zcb->zcb_embedded_histogram[BPE_GET_ETYPE(bp)]
|
|
|
|
[BPE_GET_PSIZE(bp)]++;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['L'])
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (BP_GET_DEDUP(bp)) {
|
|
|
|
ddt_t *ddt;
|
|
|
|
ddt_entry_t *dde;
|
|
|
|
|
|
|
|
ddt = ddt_select(zcb->zcb_spa, bp);
|
|
|
|
ddt_enter(ddt);
|
|
|
|
dde = ddt_lookup(ddt, bp, B_FALSE);
|
|
|
|
|
|
|
|
if (dde == NULL) {
|
|
|
|
refcnt = 0;
|
|
|
|
} else {
|
|
|
|
ddt_phys_t *ddp = ddt_phys_select(dde, bp);
|
|
|
|
ddt_phys_decref(ddp);
|
|
|
|
refcnt = ddp->ddp_refcnt;
|
|
|
|
if (ddt_phys_total_refcnt(dde) == 0)
|
|
|
|
ddt_remove(ddt, dde);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_exit(ddt);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(zio_wait(zio_claim(NULL, zcb->zcb_spa,
|
2016-12-17 01:11:29 +03:00
|
|
|
refcnt ? 0 : spa_min_claim_txg(zcb->zcb_spa),
|
2010-05-29 00:45:14 +04:00
|
|
|
bp, NULL, NULL, ZIO_FLAG_CANFAIL)), ==, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-05-03 03:36:32 +04:00
|
|
|
static void
|
|
|
|
zdb_blkptr_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
spa_t *spa = zio->io_spa;
|
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
int ioerr = zio->io_error;
|
|
|
|
zdb_cb_t *zcb = zio->io_private;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t *zb = &zio->io_bookmark;
|
2013-05-03 03:36:32 +04:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_free(zio->io_abd);
|
2013-05-03 03:36:32 +04:00
|
|
|
|
|
|
|
mutex_enter(&spa->spa_scrub_lock);
|
2019-08-13 17:11:57 +03:00
|
|
|
spa->spa_load_verify_bytes -= BP_GET_PSIZE(bp);
|
2013-05-03 03:36:32 +04:00
|
|
|
cv_broadcast(&spa->spa_scrub_io_cv);
|
|
|
|
|
|
|
|
if (ioerr && !(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
|
|
|
zcb->zcb_haderrors = 1;
|
|
|
|
zcb->zcb_errors[ioerr]++;
|
|
|
|
|
|
|
|
if (dump_opt['b'] >= 2)
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2013-05-03 03:36:32 +04:00
|
|
|
else
|
|
|
|
blkbuf[0] = '\0';
|
|
|
|
|
|
|
|
(void) printf("zdb_blkptr_cb: "
|
|
|
|
"Got error %d reading "
|
|
|
|
"<%llu, %llu, %lld, %llx> %s -- skipping\n",
|
|
|
|
ioerr,
|
|
|
|
(u_longlong_t)zb->zb_objset,
|
|
|
|
(u_longlong_t)zb->zb_object,
|
|
|
|
(u_longlong_t)zb->zb_level,
|
|
|
|
(u_longlong_t)zb->zb_blkid,
|
|
|
|
blkbuf);
|
|
|
|
}
|
|
|
|
mutex_exit(&spa->spa_scrub_lock);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
2013-07-03 00:26:24 +04:00
|
|
|
zdb_blkptr_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
|
2014-06-25 22:37:59 +04:00
|
|
|
const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
zdb_cb_t *zcb = arg;
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_object_type_t type;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t is_metadata;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (zb->zb_level == ZB_DNODE_LEVEL)
|
2015-12-22 04:31:57 +03:00
|
|
|
return (0);
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (dump_opt['b'] >= 5 && bp->blk_birth > 0) {
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
|
|
|
(void) printf("objset %llu object %llu "
|
|
|
|
"level %lld offset 0x%llx %s\n",
|
|
|
|
(u_longlong_t)zb->zb_objset,
|
|
|
|
(u_longlong_t)zb->zb_object,
|
|
|
|
(longlong_t)zb->zb_level,
|
|
|
|
(u_longlong_t)blkid2offset(dnp, bp, zb),
|
|
|
|
blkbuf);
|
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (BP_IS_HOLE(bp) || BP_IS_REDACTED(bp))
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
type = BP_GET_TYPE(bp);
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
zdb_count_block(zcb, zilog, bp,
|
|
|
|
(type & DMU_OT_NEWTYPE) ? ZDB_OT_OTHER : type);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
is_metadata = (BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp) &&
|
|
|
|
(dump_opt['c'] > 1 || (dump_opt['c'] && is_metadata))) {
|
2010-05-29 00:45:14 +04:00
|
|
|
size_t size = BP_GET_PSIZE(bp);
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_t *abd = abd_alloc(size, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
int flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SCRUB | ZIO_FLAG_RAW;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* If it's an intent log block, failure is expected. */
|
|
|
|
if (zb->zb_level == ZB_ZIL_LEVEL)
|
|
|
|
flags |= ZIO_FLAG_SPECULATIVE;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-05-03 03:36:32 +04:00
|
|
|
mutex_enter(&spa->spa_scrub_lock);
|
2019-08-13 17:11:57 +03:00
|
|
|
while (spa->spa_load_verify_bytes > max_inflight_bytes)
|
2013-05-03 03:36:32 +04:00
|
|
|
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
|
2019-08-13 17:11:57 +03:00
|
|
|
spa->spa_load_verify_bytes += size;
|
2013-05-03 03:36:32 +04:00
|
|
|
mutex_exit(&spa->spa_scrub_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_read(NULL, spa, bp, abd, size,
|
2013-05-03 03:36:32 +04:00
|
|
|
zdb_blkptr_done, zcb, ZIO_PRIORITY_ASYNC_READ, flags, zb));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
zcb->zcb_readfails = 0;
|
|
|
|
|
2015-05-15 02:41:29 +03:00
|
|
|
/* only call gethrtime() every 100 blocks */
|
|
|
|
static int iters;
|
|
|
|
if (++iters > 100)
|
|
|
|
iters = 0;
|
|
|
|
else
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (dump_opt['b'] < 5 && gethrtime() > zcb->zcb_lastprint + NANOSEC) {
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t now = gethrtime();
|
|
|
|
char buf[10];
|
|
|
|
uint64_t bytes = zcb->zcb_type[ZB_TOTAL][ZDB_OT_TOTAL].zb_asize;
|
|
|
|
int kb_per_sec =
|
|
|
|
1 + bytes / (1 + ((now - zcb->zcb_start) / 1000 / 1000));
|
|
|
|
int sec_remaining =
|
|
|
|
(zcb->zcb_totalasize - bytes) / 1024 / kb_per_sec;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (buf) >= NN_NUMBUF_SZ);
|
|
|
|
|
2017-05-02 23:43:53 +03:00
|
|
|
zfs_nicebytes(bytes, buf, sizeof (buf));
|
2013-03-25 01:24:51 +04:00
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\r%5s completed (%4dMB/s) "
|
|
|
|
"estimated time remaining: %uhr %02umin %02usec ",
|
|
|
|
buf, kb_per_sec / 1024,
|
|
|
|
sec_remaining / 60 / 60,
|
|
|
|
sec_remaining / 60 % 60,
|
|
|
|
sec_remaining % 60);
|
|
|
|
|
|
|
|
zcb->zcb_lastprint = now;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
zdb_leak(void *arg, uint64_t start, uint64_t size)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
vdev_t *vd = arg;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) printf("leaked space: vdev %llu, offset 0x%llx, size %llu\n",
|
|
|
|
(u_longlong_t)vd->vdev_id, (u_longlong_t)start, (u_longlong_t)size);
|
|
|
|
}
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
static metaslab_ops_t zdb_metaslab_ops = {
|
2014-07-20 00:19:24 +04:00
|
|
|
NULL /* alloc */
|
2010-05-29 00:45:14 +04:00
|
|
|
};
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
typedef int (*zdb_log_sm_cb_t)(spa_t *spa, space_map_entry_t *sme,
|
|
|
|
uint64_t txg, void *arg);
|
|
|
|
|
|
|
|
typedef struct unflushed_iter_cb_arg {
|
|
|
|
spa_t *uic_spa;
|
|
|
|
uint64_t uic_txg;
|
|
|
|
void *uic_arg;
|
|
|
|
zdb_log_sm_cb_t uic_cb;
|
|
|
|
} unflushed_iter_cb_arg_t;
|
|
|
|
|
|
|
|
static int
|
|
|
|
iterate_through_spacemap_logs_cb(space_map_entry_t *sme, void *arg)
|
|
|
|
{
|
|
|
|
unflushed_iter_cb_arg_t *uic = arg;
|
|
|
|
return (uic->uic_cb(uic->uic_spa, sme, uic->uic_txg, uic->uic_arg));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
iterate_through_spacemap_logs(spa_t *spa, zdb_log_sm_cb_t cb, void *arg)
|
|
|
|
{
|
|
|
|
if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP))
|
|
|
|
return;
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg);
|
|
|
|
sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls)) {
|
|
|
|
space_map_t *sm = NULL;
|
|
|
|
VERIFY0(space_map_open(&sm, spa_meta_objset(spa),
|
|
|
|
sls->sls_sm_obj, 0, UINT64_MAX, SPA_MINBLOCKSHIFT));
|
|
|
|
|
|
|
|
unflushed_iter_cb_arg_t uic = {
|
|
|
|
.uic_spa = spa,
|
|
|
|
.uic_txg = sls->sls_txg,
|
|
|
|
.uic_arg = arg,
|
|
|
|
.uic_cb = cb
|
|
|
|
};
|
|
|
|
|
|
|
|
VERIFY0(space_map_iterate(sm, space_map_length(sm),
|
|
|
|
iterate_through_spacemap_logs_cb, &uic));
|
|
|
|
space_map_close(sm);
|
|
|
|
}
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
load_unflushed_svr_segs_cb(spa_t *spa, space_map_entry_t *sme,
|
|
|
|
uint64_t txg, void *arg)
|
|
|
|
{
|
|
|
|
spa_vdev_removal_t *svr = arg;
|
|
|
|
|
|
|
|
uint64_t offset = sme->sme_offset;
|
|
|
|
uint64_t size = sme->sme_run;
|
|
|
|
|
|
|
|
/* skip vdevs we don't care about */
|
|
|
|
if (sme->sme_vdev != svr->svr_vdev_id)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
vdev_t *vd = vdev_lookup_top(spa, sme->sme_vdev);
|
|
|
|
metaslab_t *ms = vd->vdev_ms[offset >> vd->vdev_ms_shift];
|
|
|
|
ASSERT(sme->sme_type == SM_ALLOC || sme->sme_type == SM_FREE);
|
|
|
|
|
|
|
|
if (txg < metaslab_unflushed_txg(ms))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
ASSERT(vim != NULL);
|
|
|
|
if (offset >= vdev_indirect_mapping_max_offset(vim))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (sme->sme_type == SM_ALLOC)
|
|
|
|
range_tree_add(svr->svr_allocd_segs, offset, size);
|
|
|
|
else
|
|
|
|
range_tree_remove(svr->svr_allocd_segs, offset, size);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
claim_segment_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
|
|
|
|
uint64_t size, void *arg)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This callback was called through a remap from
|
|
|
|
* a device being removed. Therefore, the vdev that
|
|
|
|
* this callback is applied to is a concrete
|
|
|
|
* vdev.
|
|
|
|
*/
|
|
|
|
ASSERT(vdev_is_concrete(vd));
|
|
|
|
|
|
|
|
VERIFY0(metaslab_claim_impl(vd, offset, size,
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_min_claim_txg(vd->vdev_spa)));
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
claim_segment_cb(void *arg, uint64_t offset, uint64_t size)
|
|
|
|
{
|
|
|
|
vdev_t *vd = arg;
|
|
|
|
|
|
|
|
vdev_indirect_ops.vdev_op_remap(vd, offset, size,
|
|
|
|
claim_segment_impl_cb, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After accounting for all allocated blocks that are directly referenced,
|
|
|
|
* we might have missed a reference to a block from a partially complete
|
|
|
|
* (and thus unused) indirect mapping object. We perform a secondary pass
|
|
|
|
* through the metaslabs we have already mapped and claim the destination
|
|
|
|
* blocks.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
zdb_claim_removing(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
2019-01-30 20:54:27 +03:00
|
|
|
if (dump_opt['L'])
|
|
|
|
return;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (spa->spa_vdev_removal == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
|
|
|
spa_vdev_removal_t *svr = spa->spa_vdev_removal;
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
vdev_t *vd = vdev_lookup_top(spa, svr->svr_vdev_id);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ASSERT0(range_tree_space(svr->svr_allocd_segs));
|
|
|
|
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
range_tree_t *allocs = range_tree_create(NULL, RANGE_SEG64, NULL, 0, 0);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
for (uint64_t msi = 0; msi < vd->vdev_ms_count; msi++) {
|
|
|
|
metaslab_t *msp = vd->vdev_ms[msi];
|
|
|
|
|
|
|
|
if (msp->ms_start >= vdev_indirect_mapping_max_offset(vim))
|
|
|
|
break;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
ASSERT0(range_tree_space(allocs));
|
|
|
|
if (msp->ms_sm != NULL)
|
|
|
|
VERIFY0(space_map_load(msp->ms_sm, allocs, SM_ALLOC));
|
|
|
|
range_tree_vacate(allocs, range_tree_add, svr->svr_allocd_segs);
|
|
|
|
}
|
|
|
|
range_tree_destroy(allocs);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
iterate_through_spacemap_logs(spa, load_unflushed_svr_segs_cb, svr);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
/*
|
|
|
|
* Clear everything past what has been synced,
|
|
|
|
* because we have not allocated mappings for
|
|
|
|
* it yet.
|
|
|
|
*/
|
|
|
|
range_tree_clear(svr->svr_allocd_segs,
|
|
|
|
vdev_indirect_mapping_max_offset(vim),
|
|
|
|
vd->vdev_asize - vdev_indirect_mapping_max_offset(vim));
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
zcb->zcb_removing_size += range_tree_space(svr->svr_allocd_segs);
|
|
|
|
range_tree_vacate(svr->svr_allocd_segs, claim_segment_cb, vd);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
2019-07-26 20:54:14 +03:00
|
|
|
increment_indirect_mapping_cb(void *arg, const blkptr_t *bp, boolean_t bp_freed,
|
|
|
|
dmu_tx_t *tx)
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
{
|
|
|
|
zdb_cb_t *zcb = arg;
|
|
|
|
spa_t *spa = zcb->zcb_spa;
|
|
|
|
vdev_t *vd;
|
|
|
|
const dva_t *dva = &bp->blk_dva[0];
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
ASSERT(!bp_freed);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ASSERT(!dump_opt['L']);
|
|
|
|
ASSERT3U(BP_GET_NDVAS(bp), ==, 1);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
vd = vdev_lookup_top(zcb->zcb_spa, DVA_GET_VDEV(dva));
|
|
|
|
ASSERT3P(vd, !=, NULL);
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
ASSERT(vd->vdev_indirect_config.vic_mapping_object != 0);
|
|
|
|
ASSERT3P(zcb->zcb_vd_obsolete_counts[vd->vdev_id], !=, NULL);
|
|
|
|
|
|
|
|
vdev_indirect_mapping_increment_obsolete_count(
|
|
|
|
vd->vdev_indirect_mapping,
|
|
|
|
DVA_GET_OFFSET(dva), DVA_GET_ASIZE(dva),
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id]);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint32_t *
|
|
|
|
zdb_load_obsolete_counts(vdev_t *vd)
|
|
|
|
{
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
spa_condensing_indirect_phys_t *scip =
|
|
|
|
&spa->spa_condensing_indirect_phys;
|
2018-10-10 01:42:42 +03:00
|
|
|
uint64_t obsolete_sm_object;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint32_t *counts;
|
|
|
|
|
2018-10-10 01:42:42 +03:00
|
|
|
VERIFY0(vdev_obsolete_sm_object(vd, &obsolete_sm_object));
|
|
|
|
EQUIV(obsolete_sm_object != 0, vd->vdev_obsolete_sm != NULL);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
counts = vdev_indirect_mapping_load_obsolete_counts(vim);
|
|
|
|
if (vd->vdev_obsolete_sm != NULL) {
|
|
|
|
vdev_indirect_mapping_load_obsolete_spacemap(vim, counts,
|
|
|
|
vd->vdev_obsolete_sm);
|
|
|
|
}
|
|
|
|
if (scip->scip_vdev == vd->vdev_id &&
|
|
|
|
scip->scip_prev_obsolete_sm_object != 0) {
|
|
|
|
space_map_t *prev_obsolete_sm = NULL;
|
|
|
|
VERIFY0(space_map_open(&prev_obsolete_sm, spa->spa_meta_objset,
|
|
|
|
scip->scip_prev_obsolete_sm_object, 0, vd->vdev_asize, 0));
|
|
|
|
vdev_indirect_mapping_load_obsolete_spacemap(vim, counts,
|
|
|
|
prev_obsolete_sm);
|
|
|
|
space_map_close(prev_obsolete_sm);
|
|
|
|
}
|
|
|
|
return (counts);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
zdb_ddt_leak_init(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ddt_bookmark_t ddb;
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_entry_t dde;
|
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int p;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
ASSERT(!dump_opt['L']);
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&ddb, sizeof (ddb));
|
2010-05-29 00:45:14 +04:00
|
|
|
while ((error = ddt_walk(spa, &ddb, &dde)) == 0) {
|
|
|
|
blkptr_t blk;
|
|
|
|
ddt_phys_t *ddp = dde.dde_phys;
|
|
|
|
|
|
|
|
if (ddb.ddb_class == DDT_CLASS_UNIQUE)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(ddt_phys_total_refcnt(&dde) > 1);
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ddp->ddp_phys_birth == 0)
|
|
|
|
continue;
|
|
|
|
ddt_bp_create(ddb.ddb_checksum,
|
|
|
|
&dde.dde_key, ddp, &blk);
|
|
|
|
if (p == DDT_PHYS_DITTO) {
|
|
|
|
zdb_count_block(zcb, NULL, &blk, ZDB_OT_DITTO);
|
|
|
|
} else {
|
|
|
|
zcb->zcb_dedup_asize +=
|
|
|
|
BP_GET_ASIZE(&blk) * (ddp->ddp_refcnt - 1);
|
|
|
|
zcb->zcb_dedup_blocks++;
|
|
|
|
}
|
|
|
|
}
|
2019-01-30 20:54:27 +03:00
|
|
|
ddt_t *ddt = spa->spa_ddt[ddb.ddb_checksum];
|
|
|
|
ddt_enter(ddt);
|
|
|
|
VERIFY(ddt_lookup(ddt, &blk, B_TRUE) != NULL);
|
|
|
|
ddt_exit(ddt);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(error == ENOENT);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
typedef struct checkpoint_sm_exclude_entry_arg {
|
|
|
|
vdev_t *cseea_vd;
|
|
|
|
uint64_t cseea_checkpoint_size;
|
|
|
|
} checkpoint_sm_exclude_entry_arg_t;
|
|
|
|
|
|
|
|
static int
|
2017-08-04 19:30:49 +03:00
|
|
|
checkpoint_sm_exclude_entry_cb(space_map_entry_t *sme, void *arg)
|
2016-12-17 01:11:29 +03:00
|
|
|
{
|
|
|
|
checkpoint_sm_exclude_entry_arg_t *cseea = arg;
|
|
|
|
vdev_t *vd = cseea->cseea_vd;
|
2017-08-04 19:30:49 +03:00
|
|
|
metaslab_t *ms = vd->vdev_ms[sme->sme_offset >> vd->vdev_ms_shift];
|
|
|
|
uint64_t end = sme->sme_offset + sme->sme_run;
|
2016-12-17 01:11:29 +03:00
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
ASSERT(sme->sme_type == SM_FREE);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since the vdev_checkpoint_sm exists in the vdev level
|
|
|
|
* and the ms_sm space maps exist in the metaslab level,
|
|
|
|
* an entry in the checkpoint space map could theoretically
|
|
|
|
* cross the boundaries of the metaslab that it belongs.
|
|
|
|
*
|
|
|
|
* In reality, because of the way that we populate and
|
|
|
|
* manipulate the checkpoint's space maps currently,
|
|
|
|
* there shouldn't be any entries that cross metaslabs.
|
|
|
|
* Hence the assertion below.
|
|
|
|
*
|
|
|
|
* That said, there is no fundamental requirement that
|
|
|
|
* the checkpoint's space map entries should not cross
|
|
|
|
* metaslab boundaries. So if needed we could add code
|
|
|
|
* that handles metaslab-crossing segments in the future.
|
|
|
|
*/
|
2017-08-04 19:30:49 +03:00
|
|
|
VERIFY3U(sme->sme_offset, >=, ms->ms_start);
|
2016-12-17 01:11:29 +03:00
|
|
|
VERIFY3U(end, <=, ms->ms_start + ms->ms_size);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* By removing the entry from the allocated segments we
|
|
|
|
* also verify that the entry is there to begin with.
|
|
|
|
*/
|
|
|
|
mutex_enter(&ms->ms_lock);
|
2017-08-04 19:30:49 +03:00
|
|
|
range_tree_remove(ms->ms_allocatable, sme->sme_offset, sme->sme_run);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ms->ms_lock);
|
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
cseea->cseea_checkpoint_size += sme->sme_run;
|
2016-12-17 01:11:29 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_leak_init_vdev_exclude_checkpoint(vdev_t *vd, zdb_cb_t *zcb)
|
|
|
|
{
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
space_map_t *checkpoint_sm = NULL;
|
|
|
|
uint64_t checkpoint_sm_obj;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there is no vdev_top_zap, we are in a pool whose
|
|
|
|
* version predates the pool checkpoint feature.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_top_zap == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there is no reference of the vdev_checkpoint_sm in
|
|
|
|
* the vdev_top_zap, then one of the following scenarios
|
|
|
|
* is true:
|
|
|
|
*
|
|
|
|
* 1] There is no checkpoint
|
|
|
|
* 2] There is a checkpoint, but no checkpointed blocks
|
|
|
|
* have been freed yet
|
|
|
|
* 3] The current vdev is indirect
|
|
|
|
*
|
|
|
|
* In these cases we return immediately.
|
|
|
|
*/
|
|
|
|
if (zap_contains(spa_meta_objset(spa), vd->vdev_top_zap,
|
|
|
|
VDEV_TOP_ZAP_POOL_CHECKPOINT_SM) != 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
VERIFY0(zap_lookup(spa_meta_objset(spa), vd->vdev_top_zap,
|
|
|
|
VDEV_TOP_ZAP_POOL_CHECKPOINT_SM, sizeof (uint64_t), 1,
|
|
|
|
&checkpoint_sm_obj));
|
|
|
|
|
|
|
|
checkpoint_sm_exclude_entry_arg_t cseea;
|
|
|
|
cseea.cseea_vd = vd;
|
|
|
|
cseea.cseea_checkpoint_size = 0;
|
|
|
|
|
|
|
|
VERIFY0(space_map_open(&checkpoint_sm, spa_meta_objset(spa),
|
|
|
|
checkpoint_sm_obj, 0, vd->vdev_asize, vd->vdev_ashift));
|
|
|
|
|
|
|
|
VERIFY0(space_map_iterate(checkpoint_sm,
|
2019-02-12 21:38:11 +03:00
|
|
|
space_map_length(checkpoint_sm),
|
2016-12-17 01:11:29 +03:00
|
|
|
checkpoint_sm_exclude_entry_cb, &cseea));
|
|
|
|
space_map_close(checkpoint_sm);
|
|
|
|
|
|
|
|
zcb->zcb_checkpoint_size += cseea.cseea_checkpoint_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_leak_init_exclude_checkpoint(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
2019-01-30 20:54:27 +03:00
|
|
|
ASSERT(!dump_opt['L']);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
for (uint64_t c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
ASSERT3U(c, ==, rvd->vdev_child[c]->vdev_id);
|
|
|
|
zdb_leak_init_vdev_exclude_checkpoint(rvd->vdev_child[c], zcb);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static int
|
|
|
|
count_unflushed_space_cb(spa_t *spa, space_map_entry_t *sme,
|
|
|
|
uint64_t txg, void *arg)
|
|
|
|
{
|
|
|
|
int64_t *ualloc_space = arg;
|
|
|
|
|
|
|
|
uint64_t offset = sme->sme_offset;
|
|
|
|
uint64_t vdev_id = sme->sme_vdev;
|
|
|
|
|
|
|
|
vdev_t *vd = vdev_lookup_top(spa, vdev_id);
|
|
|
|
if (!vdev_is_concrete(vd))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
metaslab_t *ms = vd->vdev_ms[offset >> vd->vdev_ms_shift];
|
|
|
|
ASSERT(sme->sme_type == SM_ALLOC || sme->sme_type == SM_FREE);
|
|
|
|
|
|
|
|
if (txg < metaslab_unflushed_txg(ms))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (sme->sme_type == SM_ALLOC)
|
|
|
|
*ualloc_space += sme->sme_run;
|
|
|
|
else
|
|
|
|
*ualloc_space -= sme->sme_run;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int64_t
|
|
|
|
get_unflushed_alloc_space(spa_t *spa)
|
|
|
|
{
|
|
|
|
if (dump_opt['L'])
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
int64_t ualloc_space = 0;
|
|
|
|
iterate_through_spacemap_logs(spa, count_unflushed_space_cb,
|
|
|
|
&ualloc_space);
|
|
|
|
return (ualloc_space);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
load_unflushed_cb(spa_t *spa, space_map_entry_t *sme, uint64_t txg, void *arg)
|
|
|
|
{
|
|
|
|
maptype_t *uic_maptype = arg;
|
|
|
|
|
|
|
|
uint64_t offset = sme->sme_offset;
|
|
|
|
uint64_t size = sme->sme_run;
|
|
|
|
uint64_t vdev_id = sme->sme_vdev;
|
|
|
|
|
|
|
|
vdev_t *vd = vdev_lookup_top(spa, vdev_id);
|
|
|
|
|
|
|
|
/* skip indirect vdevs */
|
|
|
|
if (!vdev_is_concrete(vd))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
metaslab_t *ms = vd->vdev_ms[offset >> vd->vdev_ms_shift];
|
|
|
|
|
|
|
|
ASSERT(sme->sme_type == SM_ALLOC || sme->sme_type == SM_FREE);
|
|
|
|
ASSERT(*uic_maptype == SM_ALLOC || *uic_maptype == SM_FREE);
|
|
|
|
|
|
|
|
if (txg < metaslab_unflushed_txg(ms))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (*uic_maptype == sme->sme_type)
|
|
|
|
range_tree_add(ms->ms_allocatable, offset, size);
|
|
|
|
else
|
|
|
|
range_tree_remove(ms->ms_allocatable, offset, size);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
load_unflushed_to_ms_allocatables(spa_t *spa, maptype_t maptype)
|
|
|
|
{
|
|
|
|
iterate_through_spacemap_logs(spa, load_unflushed_cb, &maptype);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
static void
|
|
|
|
load_concrete_ms_allocatable_trees(spa_t *spa, maptype_t maptype)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
for (uint64_t i = 0; i < rvd->vdev_children; i++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[i];
|
|
|
|
|
|
|
|
ASSERT3U(i, ==, vd->vdev_id);
|
|
|
|
|
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
|
|
|
metaslab_t *msp = vd->vdev_ms[m];
|
|
|
|
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\rloading concrete vdev %llu, "
|
|
|
|
"metaslab %llu of %llu ...",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)msp->ms_id,
|
|
|
|
(longlong_t)vd->vdev_ms_count);
|
|
|
|
|
|
|
|
mutex_enter(&msp->ms_lock);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
range_tree_vacate(msp->ms_allocatable, NULL, NULL);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to spend the CPU manipulating the
|
|
|
|
* size-ordered tree, so clear the range_tree ops.
|
|
|
|
*/
|
|
|
|
msp->ms_allocatable->rt_ops = NULL;
|
|
|
|
|
|
|
|
if (msp->ms_sm != NULL) {
|
|
|
|
VERIFY0(space_map_load(msp->ms_sm,
|
|
|
|
msp->ms_allocatable, maptype));
|
|
|
|
}
|
|
|
|
if (!msp->ms_loaded)
|
|
|
|
msp->ms_loaded = B_TRUE;
|
|
|
|
mutex_exit(&msp->ms_lock);
|
|
|
|
}
|
|
|
|
}
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
|
|
|
|
load_unflushed_to_ms_allocatables(spa, maptype);
|
2016-12-17 01:11:29 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_idxp is an in-out parameter which (for indirect vdevs) is the
|
|
|
|
* index in vim_entries that has the first entry in this metaslab.
|
|
|
|
* On return, it will be set to the first entry after this metaslab.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
load_indirect_ms_allocatable_tree(vdev_t *vd, metaslab_t *msp,
|
|
|
|
uint64_t *vim_idxp)
|
|
|
|
{
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
|
|
|
|
mutex_enter(&msp->ms_lock);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
range_tree_vacate(msp->ms_allocatable, NULL, NULL);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to spend the CPU manipulating the
|
|
|
|
* size-ordered tree, so clear the range_tree ops.
|
|
|
|
*/
|
|
|
|
msp->ms_allocatable->rt_ops = NULL;
|
|
|
|
|
|
|
|
for (; *vim_idxp < vdev_indirect_mapping_num_entries(vim);
|
|
|
|
(*vim_idxp)++) {
|
|
|
|
vdev_indirect_mapping_entry_phys_t *vimep =
|
|
|
|
&vim->vim_entries[*vim_idxp];
|
|
|
|
uint64_t ent_offset = DVA_MAPPING_GET_SRC_OFFSET(vimep);
|
|
|
|
uint64_t ent_len = DVA_GET_ASIZE(&vimep->vimep_dst);
|
|
|
|
ASSERT3U(ent_offset, >=, msp->ms_start);
|
|
|
|
if (ent_offset >= msp->ms_start + msp->ms_size)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mappings do not cross metaslab boundaries,
|
|
|
|
* because we create them by walking the metaslabs.
|
|
|
|
*/
|
|
|
|
ASSERT3U(ent_offset + ent_len, <=,
|
|
|
|
msp->ms_start + msp->ms_size);
|
|
|
|
range_tree_add(msp->ms_allocatable, ent_offset, ent_len);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!msp->ms_loaded)
|
|
|
|
msp->ms_loaded = B_TRUE;
|
|
|
|
mutex_exit(&msp->ms_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_leak_init_prepare_indirect_vdevs(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
2019-01-30 20:54:27 +03:00
|
|
|
ASSERT(!dump_opt['L']);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
for (uint64_t c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
|
|
|
|
|
|
|
ASSERT3U(c, ==, vd->vdev_id);
|
|
|
|
|
|
|
|
if (vd->vdev_ops != &vdev_indirect_ops)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: we don't check for mapping leaks on
|
|
|
|
* removing vdevs because their ms_allocatable's
|
|
|
|
* are used to look for leaks in allocated space.
|
|
|
|
*/
|
|
|
|
zcb->zcb_vd_obsolete_counts[c] = zdb_load_obsolete_counts(vd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Normally, indirect vdevs don't have any
|
|
|
|
* metaslabs. We want to set them up for
|
|
|
|
* zio_claim().
|
|
|
|
*/
|
|
|
|
VERIFY0(vdev_metaslab_init(vd, 0));
|
|
|
|
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
uint64_t vim_idx = 0;
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
|
|
|
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\rloading indirect vdev %llu, "
|
|
|
|
"metaslab %llu of %llu ...",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)vd->vdev_ms[m]->ms_id,
|
|
|
|
(longlong_t)vd->vdev_ms_count);
|
|
|
|
|
|
|
|
load_indirect_ms_allocatable_tree(vd, vd->vdev_ms[m],
|
|
|
|
&vim_idx);
|
|
|
|
}
|
|
|
|
ASSERT3U(vim_idx, ==, vdev_indirect_mapping_num_entries(vim));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
zdb_leak_init(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
|
|
|
zcb->zcb_spa = spa;
|
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
if (dump_opt['L'])
|
|
|
|
return;
|
2017-01-12 22:52:56 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2017-01-12 22:52:56 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
/*
|
|
|
|
* We are going to be changing the meaning of the metaslab's
|
|
|
|
* ms_allocatable. Ensure that the allocator doesn't try to
|
|
|
|
* use the tree.
|
|
|
|
*/
|
|
|
|
spa->spa_normal_class->mc_ops = &zdb_metaslab_ops;
|
|
|
|
spa->spa_log_class->mc_ops = &zdb_metaslab_ops;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
zcb->zcb_vd_obsolete_counts =
|
|
|
|
umem_zalloc(rvd->vdev_children * sizeof (uint32_t *),
|
|
|
|
UMEM_NOFAIL);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
/*
|
|
|
|
* For leak detection, we overload the ms_allocatable trees
|
|
|
|
* to contain allocated segments instead of free segments.
|
|
|
|
* As a result, we can't use the normal metaslab_load/unload
|
|
|
|
* interfaces.
|
|
|
|
*/
|
|
|
|
zdb_leak_init_prepare_indirect_vdevs(spa, zcb);
|
|
|
|
load_concrete_ms_allocatable_trees(spa, SM_ALLOC);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
/*
|
|
|
|
* On load_concrete_ms_allocatable_trees() we loaded all the
|
|
|
|
* allocated entries from the ms_sm to the ms_allocatable for
|
|
|
|
* each metaslab. If the pool has a checkpoint or is in the
|
|
|
|
* middle of discarding a checkpoint, some of these blocks
|
|
|
|
* may have been freed but their ms_sm may not have been
|
|
|
|
* updated because they are referenced by the checkpoint. In
|
|
|
|
* order to avoid false-positives during leak-detection, we
|
|
|
|
* go through the vdev's checkpoint space map and exclude all
|
|
|
|
* its entries from their relevant ms_allocatable.
|
|
|
|
*
|
|
|
|
* We also aggregate the space held by the checkpoint and add
|
|
|
|
* it to zcb_checkpoint_size.
|
|
|
|
*
|
|
|
|
* Note that at this point we are also verifying that all the
|
|
|
|
* entries on the checkpoint_sm are marked as allocated in
|
|
|
|
* the ms_sm of their relevant metaslab.
|
|
|
|
* [see comment in checkpoint_sm_exclude_entry_cb()]
|
|
|
|
*/
|
|
|
|
zdb_leak_init_exclude_checkpoint(spa, zcb);
|
|
|
|
ASSERT3U(zcb->zcb_checkpoint_size, ==, spa_get_checkpoint_space(spa));
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
/* for cleaner progress output */
|
|
|
|
(void) fprintf(stderr, "\n");
|
|
|
|
|
|
|
|
if (bpobj_is_open(&dp->dp_obsolete_bpobj)) {
|
|
|
|
ASSERT(spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_DEVICE_REMOVAL));
|
|
|
|
(void) bpobj_iterate_nofree(&dp->dp_obsolete_bpobj,
|
|
|
|
increment_indirect_mapping_cb, zcb, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
zdb_ddt_leak_init(spa, zcb);
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static boolean_t
|
|
|
|
zdb_check_for_obsolete_leaks(vdev_t *vd, zdb_cb_t *zcb)
|
|
|
|
{
|
|
|
|
boolean_t leaks = B_FALSE;
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
uint64_t total_leaked = 0;
|
2018-10-10 01:42:42 +03:00
|
|
|
boolean_t are_precise = B_FALSE;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
ASSERT(vim != NULL);
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < vdev_indirect_mapping_num_entries(vim); i++) {
|
|
|
|
vdev_indirect_mapping_entry_phys_t *vimep =
|
|
|
|
&vim->vim_entries[i];
|
|
|
|
uint64_t obsolete_bytes = 0;
|
|
|
|
uint64_t offset = DVA_MAPPING_GET_SRC_OFFSET(vimep);
|
|
|
|
metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is not very efficient but it's easy to
|
|
|
|
* verify correctness.
|
|
|
|
*/
|
|
|
|
for (uint64_t inner_offset = 0;
|
|
|
|
inner_offset < DVA_GET_ASIZE(&vimep->vimep_dst);
|
|
|
|
inner_offset += 1 << vd->vdev_ashift) {
|
2016-12-17 01:11:29 +03:00
|
|
|
if (range_tree_contains(msp->ms_allocatable,
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
offset + inner_offset, 1 << vd->vdev_ashift)) {
|
|
|
|
obsolete_bytes += 1 << vd->vdev_ashift;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int64_t bytes_leaked = obsolete_bytes -
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id][i];
|
|
|
|
ASSERT3U(DVA_GET_ASIZE(&vimep->vimep_dst), >=,
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id][i]);
|
2018-10-10 01:42:42 +03:00
|
|
|
|
|
|
|
VERIFY0(vdev_obsolete_counts_are_precise(vd, &are_precise));
|
|
|
|
if (bytes_leaked != 0 && (are_precise || dump_opt['d'] >= 5)) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
(void) printf("obsolete indirect mapping count "
|
|
|
|
"mismatch on %llu:%llx:%llx : %llx bytes leaked\n",
|
|
|
|
(u_longlong_t)vd->vdev_id,
|
|
|
|
(u_longlong_t)DVA_MAPPING_GET_SRC_OFFSET(vimep),
|
|
|
|
(u_longlong_t)DVA_GET_ASIZE(&vimep->vimep_dst),
|
|
|
|
(u_longlong_t)bytes_leaked);
|
|
|
|
}
|
|
|
|
total_leaked += ABS(bytes_leaked);
|
|
|
|
}
|
|
|
|
|
2018-10-10 01:42:42 +03:00
|
|
|
VERIFY0(vdev_obsolete_counts_are_precise(vd, &are_precise));
|
|
|
|
if (!are_precise && total_leaked > 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
int pct_leaked = total_leaked * 100 /
|
|
|
|
vdev_indirect_mapping_bytes_mapped(vim);
|
|
|
|
(void) printf("cannot verify obsolete indirect mapping "
|
|
|
|
"counts of vdev %llu because precise feature was not "
|
|
|
|
"enabled when it was removed: %d%% (%llx bytes) of mapping"
|
|
|
|
"unreferenced\n",
|
|
|
|
(u_longlong_t)vd->vdev_id, pct_leaked,
|
|
|
|
(u_longlong_t)total_leaked);
|
|
|
|
} else if (total_leaked > 0) {
|
|
|
|
(void) printf("obsolete indirect mapping count mismatch "
|
|
|
|
"for vdev %llu -- %llx total bytes mismatched\n",
|
|
|
|
(u_longlong_t)vd->vdev_id,
|
|
|
|
(u_longlong_t)total_leaked);
|
|
|
|
leaks |= B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
vdev_indirect_mapping_free_obsolete_counts(vim,
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id]);
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id] = NULL;
|
|
|
|
|
|
|
|
return (leaks);
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
zdb_leak_fini(spa_t *spa, zdb_cb_t *zcb)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2019-01-30 20:54:27 +03:00
|
|
|
if (dump_opt['L'])
|
|
|
|
return (B_FALSE);
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
boolean_t leaks = B_FALSE;
|
2019-01-30 20:54:27 +03:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
for (unsigned c = 0; c < rvd->vdev_children; c++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
|
|
|
ASSERTV(metaslab_group_t *mg = vd->vdev_mg);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
if (zcb->zcb_vd_obsolete_counts[c] != NULL) {
|
|
|
|
leaks |= zdb_check_for_obsolete_leaks(vd, zcb);
|
|
|
|
}
|
2017-01-12 22:52:56 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
|
|
|
metaslab_t *msp = vd->vdev_ms[m];
|
|
|
|
ASSERT3P(mg, ==, msp->ms_group);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ms_allocatable has been overloaded
|
|
|
|
* to contain allocated segments. Now that
|
|
|
|
* we finished traversing all blocks, any
|
|
|
|
* block that remains in the ms_allocatable
|
|
|
|
* represents an allocated block that we
|
|
|
|
* did not claim during the traversal.
|
|
|
|
* Claimed blocks would have been removed
|
|
|
|
* from the ms_allocatable. For indirect
|
|
|
|
* vdevs, space remaining in the tree
|
|
|
|
* represents parts of the mapping that are
|
|
|
|
* not referenced, which is not a bug.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
range_tree_vacate(msp->ms_allocatable,
|
|
|
|
NULL, NULL);
|
|
|
|
} else {
|
|
|
|
range_tree_vacate(msp->ms_allocatable,
|
|
|
|
zdb_leak, vd);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2019-01-30 20:54:27 +03:00
|
|
|
if (msp->ms_loaded) {
|
|
|
|
msp->ms_loaded = B_FALSE;
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2019-01-30 20:54:27 +03:00
|
|
|
|
|
|
|
umem_free(zcb->zcb_vd_obsolete_counts,
|
|
|
|
rvd->vdev_children * sizeof (uint32_t *));
|
|
|
|
zcb->zcb_vd_obsolete_counts = NULL;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (leaks);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
count_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
zdb_cb_t *zcb = arg;
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (dump_opt['b'] >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("[%s] %s\n",
|
|
|
|
"deferred free", blkbuf);
|
|
|
|
}
|
|
|
|
zdb_count_block(zcb, NULL, bp, ZDB_OT_DEFERRED);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
/*
|
|
|
|
* Iterate over livelists which have been destroyed by the user but
|
|
|
|
* are still present in the MOS, waiting to be freed
|
|
|
|
*/
|
|
|
|
typedef void ll_iter_t(dsl_deadlist_t *ll, void *arg);
|
|
|
|
|
|
|
|
static void
|
|
|
|
iterate_deleted_livelists(spa_t *spa, ll_iter_t func, void *arg)
|
|
|
|
{
|
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
|
|
|
uint64_t zap_obj;
|
|
|
|
int err = zap_lookup(mos, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_DELETED_CLONES, sizeof (uint64_t), 1, &zap_obj);
|
|
|
|
if (err == ENOENT)
|
|
|
|
return;
|
|
|
|
ASSERT0(err);
|
|
|
|
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
dsl_deadlist_t ll;
|
|
|
|
/* NULL out os prior to dsl_deadlist_open in case it's garbage */
|
|
|
|
ll.dl_os = NULL;
|
|
|
|
for (zap_cursor_init(&zc, mos, zap_obj);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
(void) zap_cursor_advance(&zc)) {
|
|
|
|
dsl_deadlist_open(&ll, mos, attr.za_first_integer);
|
|
|
|
func(&ll, arg);
|
|
|
|
dsl_deadlist_close(&ll);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
bpobj_count_block_cb(void *arg, const blkptr_t *bp, boolean_t bp_freed,
|
|
|
|
dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
ASSERT(!bp_freed);
|
|
|
|
return (count_block_cb(arg, bp, tx));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
livelist_entry_count_blocks_cb(void *args, dsl_deadlist_entry_t *dle)
|
|
|
|
{
|
|
|
|
zdb_cb_t *zbc = args;
|
|
|
|
bplist_t blks;
|
|
|
|
bplist_create(&blks);
|
|
|
|
/* determine which blocks have been alloc'd but not freed */
|
|
|
|
VERIFY0(dsl_process_sub_livelist(&dle->dle_bpobj, &blks, NULL, NULL));
|
|
|
|
/* count those blocks */
|
|
|
|
(void) bplist_iterate(&blks, count_block_cb, zbc, NULL);
|
|
|
|
bplist_destroy(&blks);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
livelist_count_blocks(dsl_deadlist_t *ll, void *arg)
|
|
|
|
{
|
|
|
|
dsl_deadlist_iterate(ll, livelist_entry_count_blocks_cb, arg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Count the blocks in the livelists that have been destroyed by the user
|
|
|
|
* but haven't yet been freed.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
deleted_livelists_count_blocks(spa_t *spa, zdb_cb_t *zbc)
|
|
|
|
{
|
|
|
|
iterate_deleted_livelists(spa, livelist_count_blocks, zbc);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_livelist_cb(dsl_deadlist_t *ll, void *arg)
|
|
|
|
{
|
|
|
|
ASSERT3P(arg, ==, NULL);
|
|
|
|
global_feature_count[SPA_FEATURE_LIVELIST]++;
|
|
|
|
dump_blkptr_list(ll, "Deleted Livelist");
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Print out, register object references to, and increment feature counts for
|
|
|
|
* livelists that have been destroyed by the user but haven't yet been freed.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
deleted_livelists_dump_mos(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t zap_obj;
|
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
|
|
|
int err = zap_lookup(mos, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_DELETED_CLONES, sizeof (uint64_t), 1, &zap_obj);
|
|
|
|
if (err == ENOENT)
|
|
|
|
return;
|
|
|
|
mos_obj_refd(zap_obj);
|
|
|
|
iterate_deleted_livelists(spa, dump_livelist_cb, NULL);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
|
|
|
dump_block_stats(spa_t *spa)
|
|
|
|
{
|
2010-08-26 20:52:41 +04:00
|
|
|
zdb_cb_t zcb;
|
2008-11-20 23:01:55 +03:00
|
|
|
zdb_blkstats_t *zb, *tzb;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t norm_alloc, norm_space, total_alloc, total_found;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int flags = TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA |
|
|
|
|
TRAVERSE_NO_DECRYPT | TRAVERSE_HARD;
|
2014-06-06 01:19:08 +04:00
|
|
|
boolean_t leaks = B_FALSE;
|
2018-02-02 02:42:41 +03:00
|
|
|
int e, c, err;
|
2014-06-06 01:19:08 +04:00
|
|
|
bp_embedded_type_t i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&zcb, sizeof (zcb));
|
2013-03-25 01:24:51 +04:00
|
|
|
(void) printf("\nTraversing all blocks %s%s%s%s%s...\n\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
(dump_opt['c'] || !dump_opt['L']) ? "to verify " : "",
|
|
|
|
(dump_opt['c'] == 1) ? "metadata " : "",
|
|
|
|
dump_opt['c'] ? "checksums " : "",
|
|
|
|
(dump_opt['c'] && !dump_opt['L']) ? "and verify " : "",
|
|
|
|
!dump_opt['L'] ? "nothing leaked " : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2019-01-30 20:54:27 +03:00
|
|
|
* When leak detection is enabled we load all space maps as SM_ALLOC
|
|
|
|
* maps, then traverse the pool claiming each block we discover. If
|
|
|
|
* the pool is perfectly consistent, the segment trees will be empty
|
|
|
|
* when we're done. Anything left over is a leak; any block we can't
|
|
|
|
* claim (because it's not part of any space map) is a double
|
|
|
|
* allocation, reference to a freed block, or an unclaimed log block.
|
|
|
|
*
|
|
|
|
* When leak detection is disabled (-L option) we still traverse the
|
|
|
|
* pool claiming each block we discover, but we skip opening any space
|
|
|
|
* maps.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2013-11-01 23:26:11 +04:00
|
|
|
bzero(&zcb, sizeof (zdb_cb_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_leak_init(spa, &zcb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there's a deferred-free bplist, process that first.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) bpobj_iterate_nofree(&spa->spa_deferred_bpobj,
|
2019-07-26 20:54:14 +03:00
|
|
|
bpobj_count_block_cb, &zcb, NULL);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
|
|
|
|
(void) bpobj_iterate_nofree(&spa->spa_dsl_pool->dp_free_bpobj,
|
2019-07-26 20:54:14 +03:00
|
|
|
bpobj_count_block_cb, &zcb, NULL);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
zdb_claim_removing(spa, &zcb);
|
|
|
|
|
2013-10-08 21:13:05 +04:00
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_ASYNC_DESTROY)) {
|
2012-12-14 03:24:15 +04:00
|
|
|
VERIFY3U(0, ==, bptree_iterate(spa->spa_meta_objset,
|
|
|
|
spa->spa_dsl_pool->dp_bptree_obj, B_FALSE, count_block_cb,
|
|
|
|
&zcb, NULL));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
deleted_livelists_count_blocks(spa, &zcb);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['c'] > 1)
|
|
|
|
flags |= TRAVERSE_PREFETCH_DATA;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
zcb.zcb_totalasize = metaslab_class_get_alloc(spa_normal_class(spa));
|
2018-09-06 04:33:36 +03:00
|
|
|
zcb.zcb_totalasize += metaslab_class_get_alloc(spa_special_class(spa));
|
|
|
|
zcb.zcb_totalasize += metaslab_class_get_alloc(spa_dedup_class(spa));
|
2013-03-25 01:24:51 +04:00
|
|
|
zcb.zcb_start = zcb.zcb_lastprint = gethrtime();
|
2018-02-02 02:42:41 +03:00
|
|
|
err = traverse_pool(spa, 0, flags, zdb_blkptr_cb, &zcb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-03 03:36:32 +04:00
|
|
|
/*
|
|
|
|
* If we've traversed the data blocks then we need to wait for those
|
|
|
|
* I/Os to complete. We leverage "The Godfather" zio to wait on
|
|
|
|
* all async I/Os to complete.
|
|
|
|
*/
|
|
|
|
if (dump_opt['c']) {
|
2014-09-17 10:59:43 +04:00
|
|
|
for (c = 0; c < max_ncpus; c++) {
|
|
|
|
(void) zio_wait(spa->spa_async_zio_root[c]);
|
|
|
|
spa->spa_async_zio_root[c] = zio_root(spa, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
|
|
|
|
ZIO_FLAG_GODFATHER);
|
|
|
|
}
|
2013-05-03 03:36:32 +04:00
|
|
|
}
|
2019-08-13 17:11:57 +03:00
|
|
|
ASSERT0(spa->spa_load_verify_bytes);
|
2013-05-03 03:36:32 +04:00
|
|
|
|
2018-02-02 02:42:41 +03:00
|
|
|
/*
|
|
|
|
* Done after zio_wait() since zcb_haderrors is modified in
|
|
|
|
* zdb_blkptr_done()
|
|
|
|
*/
|
|
|
|
zcb.zcb_haderrors |= err;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zcb.zcb_haderrors) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\nError counts:\n\n");
|
|
|
|
(void) printf("\t%5s %s\n", "errno", "count");
|
2010-08-26 20:52:39 +04:00
|
|
|
for (e = 0; e < 256; e++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (zcb.zcb_errors[e] != 0) {
|
|
|
|
(void) printf("\t%5d %llu\n",
|
|
|
|
e, (u_longlong_t)zcb.zcb_errors[e]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report any leaked segments.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
leaks |= zdb_leak_fini(spa, &zcb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
tzb = &zcb.zcb_type[ZB_TOTAL][ZDB_OT_TOTAL];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
norm_alloc = metaslab_class_get_alloc(spa_normal_class(spa));
|
|
|
|
norm_space = metaslab_class_get_space(spa_normal_class(spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
total_alloc = norm_alloc +
|
|
|
|
metaslab_class_get_alloc(spa_log_class(spa)) +
|
|
|
|
metaslab_class_get_alloc(spa_special_class(spa)) +
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
metaslab_class_get_alloc(spa_dedup_class(spa)) +
|
|
|
|
get_unflushed_alloc_space(spa);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
total_found = tzb->zb_asize - zcb.zcb_dedup_asize +
|
2016-12-17 01:11:29 +03:00
|
|
|
zcb.zcb_removing_size + zcb.zcb_checkpoint_size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-01-30 20:54:27 +03:00
|
|
|
if (total_found == total_alloc && !dump_opt['L']) {
|
|
|
|
(void) printf("\n\tNo leaks (block sum matches space"
|
|
|
|
" maps exactly)\n");
|
|
|
|
} else if (!dump_opt['L']) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("block traversal size %llu != alloc %llu "
|
2009-01-16 00:59:39 +03:00
|
|
|
"(%s %lld)\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)total_found,
|
|
|
|
(u_longlong_t)total_alloc,
|
2009-01-16 00:59:39 +03:00
|
|
|
(dump_opt['L']) ? "unreachable" : "leaked",
|
2010-05-29 00:45:14 +04:00
|
|
|
(longlong_t)(total_alloc - total_found));
|
2014-06-06 01:19:08 +04:00
|
|
|
leaks = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (tzb->zb_count == 0)
|
|
|
|
return (2);
|
|
|
|
|
|
|
|
(void) printf("\n");
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu\n", "bp count:",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)tzb->zb_count);
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu\n", "ganged count:",
|
2014-11-03 22:12:40 +03:00
|
|
|
(longlong_t)tzb->zb_gangs);
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu avg: %6llu\n", "bp logical:",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)tzb->zb_lsize,
|
|
|
|
(u_longlong_t)(tzb->zb_lsize / tzb->zb_count));
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu avg: %6llu compression: %6.2f\n",
|
|
|
|
"bp physical:", (u_longlong_t)tzb->zb_psize,
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)(tzb->zb_psize / tzb->zb_count),
|
|
|
|
(double)tzb->zb_lsize / tzb->zb_psize);
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu avg: %6llu compression: %6.2f\n",
|
|
|
|
"bp allocated:", (u_longlong_t)tzb->zb_asize,
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)(tzb->zb_asize / tzb->zb_count),
|
|
|
|
(double)tzb->zb_lsize / tzb->zb_asize);
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu ref>1: %6llu deduplication: %6.2f\n",
|
|
|
|
"bp deduped:", (u_longlong_t)zcb.zcb_dedup_asize,
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)zcb.zcb_dedup_blocks,
|
|
|
|
(double)zcb.zcb_dedup_asize / tzb->zb_asize + 1.0);
|
2018-09-06 04:33:36 +03:00
|
|
|
(void) printf("\t%-16s %14llu used: %5.2f%%\n", "Normal class:",
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)norm_alloc, 100.0 * norm_alloc / norm_space);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
if (spa_special_class(spa)->mc_rotor != NULL) {
|
|
|
|
uint64_t alloc = metaslab_class_get_alloc(
|
|
|
|
spa_special_class(spa));
|
|
|
|
uint64_t space = metaslab_class_get_space(
|
|
|
|
spa_special_class(spa));
|
|
|
|
|
|
|
|
(void) printf("\t%-16s %14llu used: %5.2f%%\n",
|
|
|
|
"Special class", (u_longlong_t)alloc,
|
|
|
|
100.0 * alloc / space);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (spa_dedup_class(spa)->mc_rotor != NULL) {
|
|
|
|
uint64_t alloc = metaslab_class_get_alloc(
|
|
|
|
spa_dedup_class(spa));
|
|
|
|
uint64_t space = metaslab_class_get_space(
|
|
|
|
spa_dedup_class(spa));
|
|
|
|
|
|
|
|
(void) printf("\t%-16s %14llu used: %5.2f%%\n",
|
|
|
|
"Dedup class", (u_longlong_t)alloc,
|
|
|
|
100.0 * alloc / space);
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
for (i = 0; i < NUM_BP_EMBEDDED_TYPES; i++) {
|
|
|
|
if (zcb.zcb_embedded_blocks[i] == 0)
|
|
|
|
continue;
|
|
|
|
(void) printf("\n");
|
|
|
|
(void) printf("\tadditional, non-pointer bps of type %u: "
|
|
|
|
"%10llu\n",
|
|
|
|
i, (u_longlong_t)zcb.zcb_embedded_blocks[i]);
|
|
|
|
|
|
|
|
if (dump_opt['b'] >= 3) {
|
|
|
|
(void) printf("\t number of (compressed) bytes: "
|
|
|
|
"number of bps\n");
|
|
|
|
dump_histogram(zcb.zcb_embedded_histogram[i],
|
|
|
|
sizeof (zcb.zcb_embedded_histogram[i]) /
|
|
|
|
sizeof (zcb.zcb_embedded_histogram[i][0]), 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-11-03 22:12:40 +03:00
|
|
|
if (tzb->zb_ditto_samevdev != 0) {
|
|
|
|
(void) printf("\tDittoed blocks on same vdev: %llu\n",
|
|
|
|
(longlong_t)tzb->zb_ditto_samevdev);
|
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
if (tzb->zb_ditto_same_ms != 0) {
|
|
|
|
(void) printf("\tDittoed blocks in same metaslab: %llu\n",
|
|
|
|
(longlong_t)tzb->zb_ditto_same_ms);
|
|
|
|
}
|
2014-11-03 22:12:40 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
for (uint64_t v = 0; v < spa->spa_root_vdev->vdev_children; v++) {
|
|
|
|
vdev_t *vd = spa->spa_root_vdev->vdev_child[v];
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
|
|
|
|
if (vim == NULL) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
char mem[32];
|
|
|
|
zdb_nicenum(vdev_indirect_mapping_num_entries(vim),
|
|
|
|
mem, vdev_indirect_mapping_size(vim));
|
|
|
|
|
|
|
|
(void) printf("\tindirect vdev id %llu has %llu segments "
|
|
|
|
"(%s in memory)\n",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)vdev_indirect_mapping_num_entries(vim), mem);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['b'] >= 2) {
|
|
|
|
int l, t, level;
|
|
|
|
(void) printf("\nBlocks\tLSIZE\tPSIZE\tASIZE"
|
|
|
|
"\t avg\t comp\t%%Total\tType\n");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
for (t = 0; t <= ZDB_OT_TOTAL; t++) {
|
|
|
|
char csize[32], lsize[32], psize[32], asize[32];
|
2014-11-03 22:12:40 +03:00
|
|
|
char avg[32], gang[32];
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *typename;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (csize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (lsize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (psize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (asize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (avg) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (gang) >= NN_NUMBUF_SZ);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (t < DMU_OT_NUMTYPES)
|
|
|
|
typename = dmu_ot[t].ot_name;
|
|
|
|
else
|
|
|
|
typename = zdb_ot_extname[t - DMU_OT_NUMTYPES];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (zcb.zcb_type[ZB_TOTAL][t].zb_asize == 0) {
|
|
|
|
(void) printf("%6s\t%5s\t%5s\t%5s"
|
|
|
|
"\t%5s\t%5s\t%6s\t%s\n",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
typename);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (l = ZB_TOTAL - 1; l >= -1; l--) {
|
|
|
|
level = (l == -1 ? ZB_TOTAL : l);
|
|
|
|
zb = &zcb.zcb_type[level][t];
|
|
|
|
|
|
|
|
if (zb->zb_asize == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (dump_opt['b'] < 3 && level != ZB_TOTAL)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (level == 0 && zb->zb_asize ==
|
|
|
|
zcb.zcb_type[ZB_TOTAL][t].zb_asize)
|
|
|
|
continue;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(zb->zb_count, csize,
|
|
|
|
sizeof (csize));
|
|
|
|
zdb_nicenum(zb->zb_lsize, lsize,
|
|
|
|
sizeof (lsize));
|
|
|
|
zdb_nicenum(zb->zb_psize, psize,
|
|
|
|
sizeof (psize));
|
|
|
|
zdb_nicenum(zb->zb_asize, asize,
|
|
|
|
sizeof (asize));
|
|
|
|
zdb_nicenum(zb->zb_asize / zb->zb_count, avg,
|
|
|
|
sizeof (avg));
|
|
|
|
zdb_nicenum(zb->zb_gangs, gang, sizeof (gang));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("%6s\t%5s\t%5s\t%5s\t%5s"
|
|
|
|
"\t%5.2f\t%6.2f\t",
|
|
|
|
csize, lsize, psize, asize, avg,
|
|
|
|
(double)zb->zb_lsize / zb->zb_psize,
|
|
|
|
100.0 * zb->zb_asize / tzb->zb_asize);
|
|
|
|
|
|
|
|
if (level == ZB_TOTAL)
|
|
|
|
(void) printf("%s\n", typename);
|
|
|
|
else
|
|
|
|
(void) printf(" L%d %s\n",
|
|
|
|
level, typename);
|
2013-03-25 01:24:51 +04:00
|
|
|
|
2014-11-03 22:12:40 +03:00
|
|
|
if (dump_opt['b'] >= 3 && zb->zb_gangs > 0) {
|
|
|
|
(void) printf("\t number of ganged "
|
|
|
|
"blocks: %s\n", gang);
|
|
|
|
}
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (dump_opt['b'] >= 4) {
|
|
|
|
(void) printf("psize "
|
|
|
|
"(in 512-byte sectors): "
|
|
|
|
"number of blocks\n");
|
|
|
|
dump_histogram(zb->zb_psize_histogram,
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
PSIZE_HISTO_SIZE, 0);
|
2013-03-25 01:24:51 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
if (leaks)
|
|
|
|
return (2);
|
|
|
|
|
|
|
|
if (zcb.zcb_haderrors)
|
|
|
|
return (3);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct zdb_ddt_entry {
|
|
|
|
ddt_key_t zdde_key;
|
|
|
|
uint64_t zdde_ref_blocks;
|
|
|
|
uint64_t zdde_ref_lsize;
|
|
|
|
uint64_t zdde_ref_psize;
|
|
|
|
uint64_t zdde_ref_dsize;
|
|
|
|
avl_node_t zdde_node;
|
|
|
|
} zdb_ddt_entry_t;
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
zdb_ddt_add_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
|
2014-06-25 22:37:59 +04:00
|
|
|
const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
avl_tree_t *t = arg;
|
|
|
|
avl_index_t where;
|
|
|
|
zdb_ddt_entry_t *zdde, zdde_search;
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (zb->zb_level == ZB_DNODE_LEVEL || BP_IS_HOLE(bp) ||
|
|
|
|
BP_IS_EMBEDDED(bp))
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (dump_opt['S'] > 1 && zb->zb_level == ZB_ROOT_LEVEL) {
|
|
|
|
(void) printf("traversing objset %llu, %llu objects, "
|
|
|
|
"%lu blocks so far\n",
|
|
|
|
(u_longlong_t)zb->zb_objset,
|
2014-06-06 01:19:08 +04:00
|
|
|
(u_longlong_t)BP_GET_FILL(bp),
|
2010-05-29 00:45:14 +04:00
|
|
|
avl_numnodes(t));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (BP_IS_HOLE(bp) || BP_GET_CHECKSUM(bp) == ZIO_CHECKSUM_OFF ||
|
2012-12-14 03:24:15 +04:00
|
|
|
BP_GET_LEVEL(bp) > 0 || DMU_OT_IS_METADATA(BP_GET_TYPE(bp)))
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
ddt_key_fill(&zdde_search.zdde_key, bp);
|
|
|
|
|
|
|
|
zdde = avl_find(t, &zdde_search, &where);
|
|
|
|
|
|
|
|
if (zdde == NULL) {
|
|
|
|
zdde = umem_zalloc(sizeof (*zdde), UMEM_NOFAIL);
|
|
|
|
zdde->zdde_key = zdde_search.zdde_key;
|
|
|
|
avl_insert(t, zdde, where);
|
|
|
|
}
|
|
|
|
|
|
|
|
zdde->zdde_ref_blocks += 1;
|
|
|
|
zdde->zdde_ref_lsize += BP_GET_LSIZE(bp);
|
|
|
|
zdde->zdde_ref_psize += BP_GET_PSIZE(bp);
|
|
|
|
zdde->zdde_ref_dsize += bp_get_dsize_sync(spa, bp);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_simulated_ddt(spa_t *spa)
|
|
|
|
{
|
|
|
|
avl_tree_t t;
|
|
|
|
void *cookie = NULL;
|
|
|
|
zdb_ddt_entry_t *zdde;
|
2010-08-26 20:52:41 +04:00
|
|
|
ddt_histogram_t ddh_total;
|
|
|
|
ddt_stat_t dds_total;
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&ddh_total, sizeof (ddh_total));
|
|
|
|
bzero(&dds_total, sizeof (dds_total));
|
2010-05-29 00:45:14 +04:00
|
|
|
avl_create(&t, ddt_entry_compare,
|
|
|
|
sizeof (zdb_ddt_entry_t), offsetof(zdb_ddt_entry_t, zdde_node));
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
(void) traverse_pool(spa, 0, TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA |
|
|
|
|
TRAVERSE_NO_DECRYPT, zdb_ddt_add_cb, &t);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
|
|
|
|
while ((zdde = avl_destroy_nodes(&t, &cookie)) != NULL) {
|
|
|
|
ddt_stat_t dds;
|
|
|
|
uint64_t refcnt = zdde->zdde_ref_blocks;
|
|
|
|
ASSERT(refcnt != 0);
|
|
|
|
|
|
|
|
dds.dds_blocks = zdde->zdde_ref_blocks / refcnt;
|
|
|
|
dds.dds_lsize = zdde->zdde_ref_lsize / refcnt;
|
|
|
|
dds.dds_psize = zdde->zdde_ref_psize / refcnt;
|
|
|
|
dds.dds_dsize = zdde->zdde_ref_dsize / refcnt;
|
|
|
|
|
|
|
|
dds.dds_ref_blocks = zdde->zdde_ref_blocks;
|
|
|
|
dds.dds_ref_lsize = zdde->zdde_ref_lsize;
|
|
|
|
dds.dds_ref_psize = zdde->zdde_ref_psize;
|
|
|
|
dds.dds_ref_dsize = zdde->zdde_ref_dsize;
|
|
|
|
|
2014-04-16 07:40:22 +04:00
|
|
|
ddt_stat_add(&ddh_total.ddh_stat[highbit64(refcnt) - 1],
|
|
|
|
&dds, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
umem_free(zdde, sizeof (*zdde));
|
|
|
|
}
|
|
|
|
|
|
|
|
avl_destroy(&t);
|
|
|
|
|
|
|
|
ddt_histogram_stat(&dds_total, &ddh_total);
|
|
|
|
|
|
|
|
(void) printf("Simulated DDT histogram:\n");
|
|
|
|
|
|
|
|
zpool_dump_ddt(&dds_total, &ddh_total);
|
|
|
|
|
|
|
|
dump_dedup_ratio(&dds_total);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static int
|
|
|
|
verify_device_removal_feature_counts(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t dr_feature_refcount = 0;
|
|
|
|
uint64_t oc_feature_refcount = 0;
|
|
|
|
uint64_t indirect_vdev_count = 0;
|
|
|
|
uint64_t precise_vdev_count = 0;
|
|
|
|
uint64_t obsolete_counts_object_count = 0;
|
|
|
|
uint64_t obsolete_sm_count = 0;
|
|
|
|
uint64_t obsolete_counts_count = 0;
|
|
|
|
uint64_t scip_count = 0;
|
|
|
|
uint64_t obsolete_bpobj_count = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
spa_condensing_indirect_phys_t *scip =
|
|
|
|
&spa->spa_condensing_indirect_phys;
|
|
|
|
if (scip->scip_next_mapping_object != 0) {
|
|
|
|
vdev_t *vd = spa->spa_root_vdev->vdev_child[scip->scip_vdev];
|
|
|
|
ASSERT(scip->scip_prev_obsolete_sm_object != 0);
|
|
|
|
ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops);
|
|
|
|
|
|
|
|
(void) printf("Condensing indirect vdev %llu: new mapping "
|
|
|
|
"object %llu, prev obsolete sm %llu\n",
|
|
|
|
(u_longlong_t)scip->scip_vdev,
|
|
|
|
(u_longlong_t)scip->scip_next_mapping_object,
|
|
|
|
(u_longlong_t)scip->scip_prev_obsolete_sm_object);
|
|
|
|
if (scip->scip_prev_obsolete_sm_object != 0) {
|
|
|
|
space_map_t *prev_obsolete_sm = NULL;
|
|
|
|
VERIFY0(space_map_open(&prev_obsolete_sm,
|
|
|
|
spa->spa_meta_objset,
|
|
|
|
scip->scip_prev_obsolete_sm_object,
|
|
|
|
0, vd->vdev_asize, 0));
|
|
|
|
dump_spacemap(spa->spa_meta_objset, prev_obsolete_sm);
|
|
|
|
(void) printf("\n");
|
|
|
|
space_map_close(prev_obsolete_sm);
|
|
|
|
}
|
|
|
|
|
|
|
|
scip_count += 2;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
|
|
|
|
vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
|
|
|
|
vdev_indirect_config_t *vic = &vd->vdev_indirect_config;
|
|
|
|
|
|
|
|
if (vic->vic_mapping_object != 0) {
|
|
|
|
ASSERT(vd->vdev_ops == &vdev_indirect_ops ||
|
|
|
|
vd->vdev_removing);
|
|
|
|
indirect_vdev_count++;
|
|
|
|
|
|
|
|
if (vd->vdev_indirect_mapping->vim_havecounts) {
|
|
|
|
obsolete_counts_count++;
|
|
|
|
}
|
|
|
|
}
|
2018-10-10 01:42:42 +03:00
|
|
|
|
|
|
|
boolean_t are_precise;
|
|
|
|
VERIFY0(vdev_obsolete_counts_are_precise(vd, &are_precise));
|
|
|
|
if (are_precise) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ASSERT(vic->vic_mapping_object != 0);
|
|
|
|
precise_vdev_count++;
|
|
|
|
}
|
2018-10-10 01:42:42 +03:00
|
|
|
|
|
|
|
uint64_t obsolete_sm_object;
|
|
|
|
VERIFY0(vdev_obsolete_sm_object(vd, &obsolete_sm_object));
|
|
|
|
if (obsolete_sm_object != 0) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ASSERT(vic->vic_mapping_object != 0);
|
|
|
|
obsolete_sm_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) feature_get_refcount(spa,
|
|
|
|
&spa_feature_table[SPA_FEATURE_DEVICE_REMOVAL],
|
|
|
|
&dr_feature_refcount);
|
|
|
|
(void) feature_get_refcount(spa,
|
|
|
|
&spa_feature_table[SPA_FEATURE_OBSOLETE_COUNTS],
|
|
|
|
&oc_feature_refcount);
|
|
|
|
|
|
|
|
if (dr_feature_refcount != indirect_vdev_count) {
|
|
|
|
ret = 1;
|
|
|
|
(void) printf("Number of indirect vdevs (%llu) " \
|
|
|
|
"does not match feature count (%llu)\n",
|
|
|
|
(u_longlong_t)indirect_vdev_count,
|
|
|
|
(u_longlong_t)dr_feature_refcount);
|
|
|
|
} else {
|
|
|
|
(void) printf("Verified device_removal feature refcount " \
|
|
|
|
"of %llu is correct\n",
|
|
|
|
(u_longlong_t)dr_feature_refcount);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zap_contains(spa_meta_objset(spa), DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_OBSOLETE_BPOBJ) == 0) {
|
|
|
|
obsolete_bpobj_count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
obsolete_counts_object_count = precise_vdev_count;
|
|
|
|
obsolete_counts_object_count += obsolete_sm_count;
|
|
|
|
obsolete_counts_object_count += obsolete_counts_count;
|
|
|
|
obsolete_counts_object_count += scip_count;
|
|
|
|
obsolete_counts_object_count += obsolete_bpobj_count;
|
|
|
|
obsolete_counts_object_count += remap_deadlist_count;
|
|
|
|
|
|
|
|
if (oc_feature_refcount != obsolete_counts_object_count) {
|
|
|
|
ret = 1;
|
|
|
|
(void) printf("Number of obsolete counts objects (%llu) " \
|
|
|
|
"does not match feature count (%llu)\n",
|
|
|
|
(u_longlong_t)obsolete_counts_object_count,
|
|
|
|
(u_longlong_t)oc_feature_refcount);
|
|
|
|
(void) printf("pv:%llu os:%llu oc:%llu sc:%llu "
|
|
|
|
"ob:%llu rd:%llu\n",
|
|
|
|
(u_longlong_t)precise_vdev_count,
|
|
|
|
(u_longlong_t)obsolete_sm_count,
|
|
|
|
(u_longlong_t)obsolete_counts_count,
|
|
|
|
(u_longlong_t)scip_count,
|
|
|
|
(u_longlong_t)obsolete_bpobj_count,
|
|
|
|
(u_longlong_t)remap_deadlist_count);
|
|
|
|
} else {
|
|
|
|
(void) printf("Verified indirect_refcount feature refcount " \
|
|
|
|
"of %llu is correct\n",
|
|
|
|
(u_longlong_t)oc_feature_refcount);
|
|
|
|
}
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
2018-08-20 20:05:23 +03:00
|
|
|
static void
|
|
|
|
zdb_set_skip_mmp(char *target)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable the activity check to allow examination of
|
|
|
|
* active pools.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((spa = spa_lookup(target)) != NULL) {
|
|
|
|
spa->spa_import_flags |= ZFS_IMPORT_SKIP_MMP;
|
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
#define BOGUS_SUFFIX "_CHECKPOINTED_UNIVERSE"
|
|
|
|
/*
|
|
|
|
* Import the checkpointed state of the pool specified by the target
|
|
|
|
* parameter as readonly. The function also accepts a pool config
|
|
|
|
* as an optional parameter, else it attempts to infer the config by
|
|
|
|
* the name of the target pool.
|
|
|
|
*
|
|
|
|
* Note that the checkpointed state's pool name will be the name of
|
2019-08-30 19:43:30 +03:00
|
|
|
* the original pool with the above suffix appended to it. In addition,
|
2016-12-17 01:11:29 +03:00
|
|
|
* if the target is not a pool name (e.g. a path to a dataset) then
|
|
|
|
* the new_path parameter is populated with the updated path to
|
|
|
|
* reflect the fact that we are looking into the checkpointed state.
|
|
|
|
*
|
|
|
|
* The function returns a newly-allocated copy of the name of the
|
|
|
|
* pool containing the checkpointed state. When this copy is no
|
|
|
|
* longer needed it should be freed with free(3C). Same thing
|
|
|
|
* applies to the new_path parameter if allocated.
|
|
|
|
*/
|
|
|
|
static char *
|
|
|
|
import_checkpointed_state(char *target, nvlist_t *cfg, char **new_path)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
char *poolname, *bogus_name = NULL;
|
|
|
|
|
|
|
|
/* If the target is not a pool, the extract the pool name */
|
|
|
|
char *path_start = strchr(target, '/');
|
|
|
|
if (path_start != NULL) {
|
|
|
|
size_t poolname_len = path_start - target;
|
|
|
|
poolname = strndup(target, poolname_len);
|
|
|
|
} else {
|
|
|
|
poolname = target;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cfg == NULL) {
|
2018-08-20 20:05:23 +03:00
|
|
|
zdb_set_skip_mmp(poolname);
|
2016-12-17 01:11:29 +03:00
|
|
|
error = spa_get_stats(poolname, &cfg, NULL, 0);
|
|
|
|
if (error != 0) {
|
|
|
|
fatal("Tried to read config of pool \"%s\" but "
|
|
|
|
"spa_get_stats() failed with error %d\n",
|
|
|
|
poolname, error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (asprintf(&bogus_name, "%s%s", poolname, BOGUS_SUFFIX) == -1)
|
|
|
|
return (NULL);
|
|
|
|
fnvlist_add_string(cfg, ZPOOL_CONFIG_POOL_NAME, bogus_name);
|
|
|
|
|
|
|
|
error = spa_import(bogus_name, cfg, NULL,
|
2018-08-20 20:05:23 +03:00
|
|
|
ZFS_IMPORT_MISSING_LOG | ZFS_IMPORT_CHECKPOINT |
|
|
|
|
ZFS_IMPORT_SKIP_MMP);
|
2016-12-17 01:11:29 +03:00
|
|
|
if (error != 0) {
|
|
|
|
fatal("Tried to import pool \"%s\" but spa_import() failed "
|
|
|
|
"with error %d\n", bogus_name, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (new_path != NULL && path_start != NULL) {
|
|
|
|
if (asprintf(new_path, "%s%s", bogus_name, path_start) == -1) {
|
|
|
|
if (path_start != NULL)
|
|
|
|
free(poolname);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (target != poolname)
|
|
|
|
free(poolname);
|
|
|
|
|
|
|
|
return (bogus_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
typedef struct verify_checkpoint_sm_entry_cb_arg {
|
|
|
|
vdev_t *vcsec_vd;
|
|
|
|
|
|
|
|
/* the following fields are only used for printing progress */
|
|
|
|
uint64_t vcsec_entryid;
|
|
|
|
uint64_t vcsec_num_entries;
|
|
|
|
} verify_checkpoint_sm_entry_cb_arg_t;
|
|
|
|
|
|
|
|
#define ENTRIES_PER_PROGRESS_UPDATE 10000
|
|
|
|
|
|
|
|
static int
|
2017-08-04 19:30:49 +03:00
|
|
|
verify_checkpoint_sm_entry_cb(space_map_entry_t *sme, void *arg)
|
2016-12-17 01:11:29 +03:00
|
|
|
{
|
|
|
|
verify_checkpoint_sm_entry_cb_arg_t *vcsec = arg;
|
|
|
|
vdev_t *vd = vcsec->vcsec_vd;
|
2017-08-04 19:30:49 +03:00
|
|
|
metaslab_t *ms = vd->vdev_ms[sme->sme_offset >> vd->vdev_ms_shift];
|
|
|
|
uint64_t end = sme->sme_offset + sme->sme_run;
|
2016-12-17 01:11:29 +03:00
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
ASSERT(sme->sme_type == SM_FREE);
|
2016-12-17 01:11:29 +03:00
|
|
|
|
|
|
|
if ((vcsec->vcsec_entryid % ENTRIES_PER_PROGRESS_UPDATE) == 0) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\rverifying vdev %llu, space map entry %llu of %llu ...",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)vcsec->vcsec_entryid,
|
|
|
|
(longlong_t)vcsec->vcsec_num_entries);
|
|
|
|
}
|
|
|
|
vcsec->vcsec_entryid++;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See comment in checkpoint_sm_exclude_entry_cb()
|
|
|
|
*/
|
2017-08-04 19:30:49 +03:00
|
|
|
VERIFY3U(sme->sme_offset, >=, ms->ms_start);
|
2016-12-17 01:11:29 +03:00
|
|
|
VERIFY3U(end, <=, ms->ms_start + ms->ms_size);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The entries in the vdev_checkpoint_sm should be marked as
|
|
|
|
* allocated in the checkpointed state of the pool, therefore
|
|
|
|
* their respective ms_allocateable trees should not contain them.
|
|
|
|
*/
|
|
|
|
mutex_enter(&ms->ms_lock);
|
2019-01-25 20:51:24 +03:00
|
|
|
range_tree_verify_not_present(ms->ms_allocatable,
|
|
|
|
sme->sme_offset, sme->sme_run);
|
2016-12-17 01:11:29 +03:00
|
|
|
mutex_exit(&ms->ms_lock);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that all segments in the vdev_checkpoint_sm are allocated
|
|
|
|
* according to the checkpoint's ms_sm (i.e. are not in the checkpoint's
|
|
|
|
* ms_allocatable).
|
|
|
|
*
|
|
|
|
* Do so by comparing the checkpoint space maps (vdev_checkpoint_sm) of
|
|
|
|
* each vdev in the current state of the pool to the metaslab space maps
|
|
|
|
* (ms_sm) of the checkpointed state of the pool.
|
|
|
|
*
|
|
|
|
* Note that the function changes the state of the ms_allocatable
|
|
|
|
* trees of the current spa_t. The entries of these ms_allocatable
|
|
|
|
* trees are cleared out and then repopulated from with the free
|
|
|
|
* entries of their respective ms_sm space maps.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
verify_checkpoint_vdev_spacemaps(spa_t *checkpoint, spa_t *current)
|
|
|
|
{
|
|
|
|
vdev_t *ckpoint_rvd = checkpoint->spa_root_vdev;
|
|
|
|
vdev_t *current_rvd = current->spa_root_vdev;
|
|
|
|
|
|
|
|
load_concrete_ms_allocatable_trees(checkpoint, SM_FREE);
|
|
|
|
|
|
|
|
for (uint64_t c = 0; c < ckpoint_rvd->vdev_children; c++) {
|
|
|
|
vdev_t *ckpoint_vd = ckpoint_rvd->vdev_child[c];
|
|
|
|
vdev_t *current_vd = current_rvd->vdev_child[c];
|
|
|
|
|
|
|
|
space_map_t *checkpoint_sm = NULL;
|
|
|
|
uint64_t checkpoint_sm_obj;
|
|
|
|
|
|
|
|
if (ckpoint_vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
/*
|
|
|
|
* Since we don't allow device removal in a pool
|
|
|
|
* that has a checkpoint, we expect that all removed
|
|
|
|
* vdevs were removed from the pool before the
|
|
|
|
* checkpoint.
|
|
|
|
*/
|
|
|
|
ASSERT3P(current_vd->vdev_ops, ==, &vdev_indirect_ops);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the checkpoint space map doesn't exist, then nothing
|
|
|
|
* here is checkpointed so there's nothing to verify.
|
|
|
|
*/
|
|
|
|
if (current_vd->vdev_top_zap == 0 ||
|
|
|
|
zap_contains(spa_meta_objset(current),
|
|
|
|
current_vd->vdev_top_zap,
|
|
|
|
VDEV_TOP_ZAP_POOL_CHECKPOINT_SM) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
VERIFY0(zap_lookup(spa_meta_objset(current),
|
|
|
|
current_vd->vdev_top_zap, VDEV_TOP_ZAP_POOL_CHECKPOINT_SM,
|
|
|
|
sizeof (uint64_t), 1, &checkpoint_sm_obj));
|
|
|
|
|
|
|
|
VERIFY0(space_map_open(&checkpoint_sm, spa_meta_objset(current),
|
|
|
|
checkpoint_sm_obj, 0, current_vd->vdev_asize,
|
|
|
|
current_vd->vdev_ashift));
|
|
|
|
|
|
|
|
verify_checkpoint_sm_entry_cb_arg_t vcsec;
|
|
|
|
vcsec.vcsec_vd = ckpoint_vd;
|
|
|
|
vcsec.vcsec_entryid = 0;
|
|
|
|
vcsec.vcsec_num_entries =
|
|
|
|
space_map_length(checkpoint_sm) / sizeof (uint64_t);
|
|
|
|
VERIFY0(space_map_iterate(checkpoint_sm,
|
2019-02-12 21:38:11 +03:00
|
|
|
space_map_length(checkpoint_sm),
|
2016-12-17 01:11:29 +03:00
|
|
|
verify_checkpoint_sm_entry_cb, &vcsec));
|
2018-07-11 07:23:17 +03:00
|
|
|
if (dump_opt['m'] > 3)
|
|
|
|
dump_spacemap(current->spa_meta_objset, checkpoint_sm);
|
2016-12-17 01:11:29 +03:00
|
|
|
space_map_close(checkpoint_sm);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've added vdevs since we took the checkpoint, ensure
|
|
|
|
* that their checkpoint space maps are empty.
|
|
|
|
*/
|
|
|
|
if (ckpoint_rvd->vdev_children < current_rvd->vdev_children) {
|
|
|
|
for (uint64_t c = ckpoint_rvd->vdev_children;
|
|
|
|
c < current_rvd->vdev_children; c++) {
|
|
|
|
vdev_t *current_vd = current_rvd->vdev_child[c];
|
|
|
|
ASSERT3P(current_vd->vdev_checkpoint_sm, ==, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* for cleaner progress output */
|
|
|
|
(void) fprintf(stderr, "\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verifies that all space that's allocated in the checkpoint is
|
|
|
|
* still allocated in the current version, by checking that everything
|
|
|
|
* in checkpoint's ms_allocatable (which is actually allocated, not
|
|
|
|
* allocatable/free) is not present in current's ms_allocatable.
|
|
|
|
*
|
|
|
|
* Note that the function changes the state of the ms_allocatable
|
|
|
|
* trees of both spas when called. The entries of all ms_allocatable
|
|
|
|
* trees are cleared out and then repopulated from their respective
|
|
|
|
* ms_sm space maps. In the checkpointed state we load the allocated
|
|
|
|
* entries, and in the current state we load the free entries.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
verify_checkpoint_ms_spacemaps(spa_t *checkpoint, spa_t *current)
|
|
|
|
{
|
|
|
|
vdev_t *ckpoint_rvd = checkpoint->spa_root_vdev;
|
|
|
|
vdev_t *current_rvd = current->spa_root_vdev;
|
|
|
|
|
|
|
|
load_concrete_ms_allocatable_trees(checkpoint, SM_ALLOC);
|
|
|
|
load_concrete_ms_allocatable_trees(current, SM_FREE);
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < ckpoint_rvd->vdev_children; i++) {
|
|
|
|
vdev_t *ckpoint_vd = ckpoint_rvd->vdev_child[i];
|
|
|
|
vdev_t *current_vd = current_rvd->vdev_child[i];
|
|
|
|
|
|
|
|
if (ckpoint_vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
/*
|
|
|
|
* See comment in verify_checkpoint_vdev_spacemaps()
|
|
|
|
*/
|
|
|
|
ASSERT3P(current_vd->vdev_ops, ==, &vdev_indirect_ops);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (uint64_t m = 0; m < ckpoint_vd->vdev_ms_count; m++) {
|
|
|
|
metaslab_t *ckpoint_msp = ckpoint_vd->vdev_ms[m];
|
|
|
|
metaslab_t *current_msp = current_vd->vdev_ms[m];
|
|
|
|
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\rverifying vdev %llu of %llu, "
|
|
|
|
"metaslab %llu of %llu ...",
|
|
|
|
(longlong_t)current_vd->vdev_id,
|
|
|
|
(longlong_t)current_rvd->vdev_children,
|
|
|
|
(longlong_t)current_vd->vdev_ms[m]->ms_id,
|
|
|
|
(longlong_t)current_vd->vdev_ms_count);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We walk through the ms_allocatable trees that
|
|
|
|
* are loaded with the allocated blocks from the
|
|
|
|
* ms_sm spacemaps of the checkpoint. For each
|
|
|
|
* one of these ranges we ensure that none of them
|
|
|
|
* exists in the ms_allocatable trees of the
|
|
|
|
* current state which are loaded with the ranges
|
|
|
|
* that are currently free.
|
|
|
|
*
|
|
|
|
* This way we ensure that none of the blocks that
|
|
|
|
* are part of the checkpoint were freed by mistake.
|
|
|
|
*/
|
|
|
|
range_tree_walk(ckpoint_msp->ms_allocatable,
|
2019-01-25 20:51:24 +03:00
|
|
|
(range_tree_func_t *)range_tree_verify_not_present,
|
2016-12-17 01:11:29 +03:00
|
|
|
current_msp->ms_allocatable);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* for cleaner progress output */
|
|
|
|
(void) fprintf(stderr, "\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
verify_checkpoint_blocks(spa_t *spa)
|
|
|
|
{
|
2019-01-30 20:54:27 +03:00
|
|
|
ASSERT(!dump_opt['L']);
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
spa_t *checkpoint_spa;
|
|
|
|
char *checkpoint_pool;
|
|
|
|
nvlist_t *config = NULL;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We import the checkpointed state of the pool (under a different
|
|
|
|
* name) so we can do verification on it against the current state
|
|
|
|
* of the pool.
|
|
|
|
*/
|
|
|
|
checkpoint_pool = import_checkpointed_state(spa->spa_name, config,
|
|
|
|
NULL);
|
|
|
|
ASSERT(strcmp(spa->spa_name, checkpoint_pool) != 0);
|
|
|
|
|
|
|
|
error = spa_open(checkpoint_pool, &checkpoint_spa, FTAG);
|
|
|
|
if (error != 0) {
|
|
|
|
fatal("Tried to open pool \"%s\" but spa_open() failed with "
|
|
|
|
"error %d\n", checkpoint_pool, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that ranges in the checkpoint space maps of each vdev
|
|
|
|
* are allocated according to the checkpointed state's metaslab
|
|
|
|
* space maps.
|
|
|
|
*/
|
|
|
|
verify_checkpoint_vdev_spacemaps(checkpoint_spa, spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that allocated ranges in the checkpoint's metaslab
|
|
|
|
* space maps remain allocated in the metaslab space maps of
|
|
|
|
* the current state.
|
|
|
|
*/
|
|
|
|
verify_checkpoint_ms_spacemaps(checkpoint_spa, spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Once we are done, we get rid of the checkpointed state.
|
|
|
|
*/
|
|
|
|
spa_close(checkpoint_spa, FTAG);
|
|
|
|
free(checkpoint_pool);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_leftover_checkpoint_blocks(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < rvd->vdev_children; i++) {
|
|
|
|
vdev_t *vd = rvd->vdev_child[i];
|
|
|
|
|
|
|
|
space_map_t *checkpoint_sm = NULL;
|
|
|
|
uint64_t checkpoint_sm_obj;
|
|
|
|
|
|
|
|
if (vd->vdev_top_zap == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zap_contains(spa_meta_objset(spa), vd->vdev_top_zap,
|
|
|
|
VDEV_TOP_ZAP_POOL_CHECKPOINT_SM) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
VERIFY0(zap_lookup(spa_meta_objset(spa), vd->vdev_top_zap,
|
|
|
|
VDEV_TOP_ZAP_POOL_CHECKPOINT_SM,
|
|
|
|
sizeof (uint64_t), 1, &checkpoint_sm_obj));
|
|
|
|
|
|
|
|
VERIFY0(space_map_open(&checkpoint_sm, spa_meta_objset(spa),
|
|
|
|
checkpoint_sm_obj, 0, vd->vdev_asize, vd->vdev_ashift));
|
|
|
|
dump_spacemap(spa->spa_meta_objset, checkpoint_sm);
|
|
|
|
space_map_close(checkpoint_sm);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
verify_checkpoint(spa_t *spa)
|
|
|
|
{
|
|
|
|
uberblock_t checkpoint;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
if (!spa_feature_is_active(spa, SPA_FEATURE_POOL_CHECKPOINT))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
error = zap_lookup(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_ZPOOL_CHECKPOINT, sizeof (uint64_t),
|
|
|
|
sizeof (uberblock_t) / sizeof (uint64_t), &checkpoint);
|
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
if (error == ENOENT && !dump_opt['L']) {
|
2016-12-17 01:11:29 +03:00
|
|
|
/*
|
|
|
|
* If the feature is active but the uberblock is missing
|
|
|
|
* then we must be in the middle of discarding the
|
|
|
|
* checkpoint.
|
|
|
|
*/
|
|
|
|
(void) printf("\nPartially discarded checkpoint "
|
|
|
|
"state found:\n");
|
2018-07-11 07:23:17 +03:00
|
|
|
if (dump_opt['m'] > 3)
|
|
|
|
dump_leftover_checkpoint_blocks(spa);
|
2016-12-17 01:11:29 +03:00
|
|
|
return (0);
|
|
|
|
} else if (error != 0) {
|
|
|
|
(void) printf("lookup error %d when looking for "
|
|
|
|
"checkpointed uberblock in MOS\n", error);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
dump_uberblock(&checkpoint, "\nCheckpointed uberblock found:\n", "\n");
|
|
|
|
|
|
|
|
if (checkpoint.ub_checkpoint_txg == 0) {
|
|
|
|
(void) printf("\nub_checkpoint_txg not set in checkpointed "
|
|
|
|
"uberblock\n");
|
|
|
|
error = 3;
|
|
|
|
}
|
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
if (error == 0 && !dump_opt['L'])
|
2016-12-17 01:11:29 +03:00
|
|
|
verify_checkpoint_blocks(spa);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
mos_leaks_cb(void *arg, uint64_t start, uint64_t size)
|
|
|
|
{
|
|
|
|
for (uint64_t i = start; i < size; i++) {
|
|
|
|
(void) printf("MOS object %llu referenced but not allocated\n",
|
|
|
|
(u_longlong_t)i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
mos_obj_refd(uint64_t obj)
|
|
|
|
{
|
|
|
|
if (obj != 0 && mos_refd_objs != NULL)
|
|
|
|
range_tree_add(mos_refd_objs, obj, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Call on a MOS object that may already have been referenced.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
mos_obj_refd_multiple(uint64_t obj)
|
|
|
|
{
|
|
|
|
if (obj != 0 && mos_refd_objs != NULL &&
|
|
|
|
!range_tree_contains(mos_refd_objs, obj, 1))
|
|
|
|
range_tree_add(mos_refd_objs, obj, 1);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static void
|
|
|
|
mos_leak_vdev_top_zap(vdev_t *vd)
|
|
|
|
{
|
|
|
|
uint64_t ms_flush_data_obj;
|
|
|
|
int error = zap_lookup(spa_meta_objset(vd->vdev_spa),
|
|
|
|
vd->vdev_top_zap, VDEV_TOP_ZAP_MS_UNFLUSHED_PHYS_TXGS,
|
|
|
|
sizeof (ms_flush_data_obj), 1, &ms_flush_data_obj);
|
|
|
|
if (error == ENOENT)
|
|
|
|
return;
|
|
|
|
ASSERT0(error);
|
|
|
|
|
|
|
|
mos_obj_refd(ms_flush_data_obj);
|
|
|
|
}
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
static void
|
|
|
|
mos_leak_vdev(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mos_obj_refd(vd->vdev_dtl_object);
|
|
|
|
mos_obj_refd(vd->vdev_ms_array);
|
|
|
|
mos_obj_refd(vd->vdev_indirect_config.vic_births_object);
|
|
|
|
mos_obj_refd(vd->vdev_indirect_config.vic_mapping_object);
|
|
|
|
mos_obj_refd(vd->vdev_leaf_zap);
|
|
|
|
if (vd->vdev_checkpoint_sm != NULL)
|
|
|
|
mos_obj_refd(vd->vdev_checkpoint_sm->sm_object);
|
|
|
|
if (vd->vdev_indirect_mapping != NULL) {
|
|
|
|
mos_obj_refd(vd->vdev_indirect_mapping->
|
|
|
|
vim_phys->vimp_counts_object);
|
|
|
|
}
|
|
|
|
if (vd->vdev_obsolete_sm != NULL)
|
|
|
|
mos_obj_refd(vd->vdev_obsolete_sm->sm_object);
|
|
|
|
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
|
|
|
metaslab_t *ms = vd->vdev_ms[m];
|
|
|
|
mos_obj_refd(space_map_object(ms->ms_sm));
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (vd->vdev_top_zap != 0) {
|
|
|
|
mos_obj_refd(vd->vdev_top_zap);
|
|
|
|
mos_leak_vdev_top_zap(vd);
|
|
|
|
}
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
for (uint64_t c = 0; c < vd->vdev_children; c++) {
|
|
|
|
mos_leak_vdev(vd->vdev_child[c]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
static void
|
|
|
|
mos_leak_log_spacemaps(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t spacemap_zap;
|
|
|
|
int error = zap_lookup(spa_meta_objset(spa),
|
|
|
|
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_LOG_SPACEMAP_ZAP,
|
|
|
|
sizeof (spacemap_zap), 1, &spacemap_zap);
|
|
|
|
if (error == ENOENT)
|
|
|
|
return;
|
|
|
|
ASSERT0(error);
|
|
|
|
|
|
|
|
mos_obj_refd(spacemap_zap);
|
|
|
|
for (spa_log_sm_t *sls = avl_first(&spa->spa_sm_logs_by_txg);
|
|
|
|
sls; sls = AVL_NEXT(&spa->spa_sm_logs_by_txg, sls))
|
|
|
|
mos_obj_refd(sls->sls_sm_obj);
|
|
|
|
}
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
static int
|
|
|
|
dump_mos_leaks(spa_t *spa)
|
|
|
|
{
|
|
|
|
int rv = 0;
|
|
|
|
objset_t *mos = spa->spa_meta_objset;
|
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
|
|
|
|
|
|
|
/* Visit and mark all referenced objects in the MOS */
|
|
|
|
|
|
|
|
mos_obj_refd(DMU_POOL_DIRECTORY_OBJECT);
|
|
|
|
mos_obj_refd(spa->spa_pool_props_object);
|
|
|
|
mos_obj_refd(spa->spa_config_object);
|
|
|
|
mos_obj_refd(spa->spa_ddt_stat_object);
|
|
|
|
mos_obj_refd(spa->spa_feat_desc_obj);
|
|
|
|
mos_obj_refd(spa->spa_feat_enabled_txg_obj);
|
|
|
|
mos_obj_refd(spa->spa_feat_for_read_obj);
|
|
|
|
mos_obj_refd(spa->spa_feat_for_write_obj);
|
|
|
|
mos_obj_refd(spa->spa_history);
|
|
|
|
mos_obj_refd(spa->spa_errlog_last);
|
|
|
|
mos_obj_refd(spa->spa_errlog_scrub);
|
|
|
|
mos_obj_refd(spa->spa_all_vdev_zaps);
|
|
|
|
mos_obj_refd(spa->spa_dsl_pool->dp_bptree_obj);
|
|
|
|
mos_obj_refd(spa->spa_dsl_pool->dp_tmp_userrefs_obj);
|
|
|
|
mos_obj_refd(spa->spa_dsl_pool->dp_scan->scn_phys.scn_queue_obj);
|
|
|
|
bpobj_count_refd(&spa->spa_deferred_bpobj);
|
|
|
|
mos_obj_refd(dp->dp_empty_bpobj);
|
|
|
|
bpobj_count_refd(&dp->dp_obsolete_bpobj);
|
|
|
|
bpobj_count_refd(&dp->dp_free_bpobj);
|
|
|
|
mos_obj_refd(spa->spa_l2cache.sav_object);
|
|
|
|
mos_obj_refd(spa->spa_spares.sav_object);
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (spa->spa_syncing_log_sm != NULL)
|
|
|
|
mos_obj_refd(spa->spa_syncing_log_sm->sm_object);
|
|
|
|
mos_leak_log_spacemaps(spa);
|
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
mos_obj_refd(spa->spa_condensing_indirect_phys.
|
|
|
|
scip_next_mapping_object);
|
|
|
|
mos_obj_refd(spa->spa_condensing_indirect_phys.
|
|
|
|
scip_prev_obsolete_sm_object);
|
|
|
|
if (spa->spa_condensing_indirect_phys.scip_next_mapping_object != 0) {
|
|
|
|
vdev_indirect_mapping_t *vim =
|
|
|
|
vdev_indirect_mapping_open(mos,
|
|
|
|
spa->spa_condensing_indirect_phys.scip_next_mapping_object);
|
|
|
|
mos_obj_refd(vim->vim_phys->vimp_counts_object);
|
|
|
|
vdev_indirect_mapping_close(vim);
|
|
|
|
}
|
2019-07-26 20:54:14 +03:00
|
|
|
deleted_livelists_dump_mos(spa);
|
2018-10-12 21:28:26 +03:00
|
|
|
|
|
|
|
if (dp->dp_origin_snap != NULL) {
|
|
|
|
dsl_dataset_t *ds;
|
|
|
|
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
|
|
|
VERIFY0(dsl_dataset_hold_obj(dp,
|
|
|
|
dsl_dataset_phys(dp->dp_origin_snap)->ds_next_snap_obj,
|
|
|
|
FTAG, &ds));
|
|
|
|
count_ds_mos_objects(ds);
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_blkptr_list(&ds->ds_deadlist, "Deadlist");
|
2018-10-12 21:28:26 +03:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
|
|
|
|
count_ds_mos_objects(dp->dp_origin_snap);
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_blkptr_list(&dp->dp_origin_snap->ds_deadlist, "Deadlist");
|
2018-10-12 21:28:26 +03:00
|
|
|
}
|
|
|
|
count_dir_mos_objects(dp->dp_mos_dir);
|
|
|
|
if (dp->dp_free_dir != NULL)
|
|
|
|
count_dir_mos_objects(dp->dp_free_dir);
|
|
|
|
if (dp->dp_leak_dir != NULL)
|
|
|
|
count_dir_mos_objects(dp->dp_leak_dir);
|
|
|
|
|
|
|
|
mos_leak_vdev(spa->spa_root_vdev);
|
|
|
|
|
|
|
|
for (uint64_t class = 0; class < DDT_CLASSES; class++) {
|
|
|
|
for (uint64_t type = 0; type < DDT_TYPES; type++) {
|
|
|
|
for (uint64_t cksum = 0;
|
|
|
|
cksum < ZIO_CHECKSUM_FUNCTIONS; cksum++) {
|
|
|
|
ddt_t *ddt = spa->spa_ddt[cksum];
|
|
|
|
mos_obj_refd(ddt->ddt_object[type][class]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Visit all allocated objects and make sure they are referenced.
|
|
|
|
*/
|
|
|
|
uint64_t object = 0;
|
|
|
|
while (dmu_object_next(mos, &object, B_FALSE, 0) == 0) {
|
|
|
|
if (range_tree_contains(mos_refd_objs, object, 1)) {
|
|
|
|
range_tree_remove(mos_refd_objs, object, 1);
|
|
|
|
} else {
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
const char *name;
|
|
|
|
dmu_object_info(mos, object, &doi);
|
|
|
|
if (doi.doi_type & DMU_OT_NEWTYPE) {
|
|
|
|
dmu_object_byteswap_t bswap =
|
|
|
|
DMU_OT_BYTESWAP(doi.doi_type);
|
|
|
|
name = dmu_ot_byteswap[bswap].ob_name;
|
|
|
|
} else {
|
|
|
|
name = dmu_ot[doi.doi_type].ot_name;
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("MOS object %llu (%s) leaked\n",
|
|
|
|
(u_longlong_t)object, name);
|
|
|
|
rv = 2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
(void) range_tree_walk(mos_refd_objs, mos_leaks_cb, NULL);
|
|
|
|
if (!range_tree_is_empty(mos_refd_objs))
|
|
|
|
rv = 2;
|
|
|
|
range_tree_vacate(mos_refd_objs, NULL, NULL);
|
|
|
|
range_tree_destroy(mos_refd_objs);
|
|
|
|
return (rv);
|
|
|
|
}
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
typedef struct log_sm_obsolete_stats_arg {
|
|
|
|
uint64_t lsos_current_txg;
|
|
|
|
|
|
|
|
uint64_t lsos_total_entries;
|
|
|
|
uint64_t lsos_valid_entries;
|
|
|
|
|
|
|
|
uint64_t lsos_sm_entries;
|
|
|
|
uint64_t lsos_valid_sm_entries;
|
|
|
|
} log_sm_obsolete_stats_arg_t;
|
|
|
|
|
|
|
|
static int
|
|
|
|
log_spacemap_obsolete_stats_cb(spa_t *spa, space_map_entry_t *sme,
|
|
|
|
uint64_t txg, void *arg)
|
|
|
|
{
|
|
|
|
log_sm_obsolete_stats_arg_t *lsos = arg;
|
|
|
|
|
|
|
|
uint64_t offset = sme->sme_offset;
|
|
|
|
uint64_t vdev_id = sme->sme_vdev;
|
|
|
|
|
|
|
|
if (lsos->lsos_current_txg == 0) {
|
|
|
|
/* this is the first log */
|
|
|
|
lsos->lsos_current_txg = txg;
|
|
|
|
} else if (lsos->lsos_current_txg < txg) {
|
|
|
|
/* we just changed log - print stats and reset */
|
|
|
|
(void) printf("%-8llu valid entries out of %-8llu - txg %llu\n",
|
|
|
|
(u_longlong_t)lsos->lsos_valid_sm_entries,
|
|
|
|
(u_longlong_t)lsos->lsos_sm_entries,
|
|
|
|
(u_longlong_t)lsos->lsos_current_txg);
|
|
|
|
lsos->lsos_valid_sm_entries = 0;
|
|
|
|
lsos->lsos_sm_entries = 0;
|
|
|
|
lsos->lsos_current_txg = txg;
|
|
|
|
}
|
|
|
|
ASSERT3U(lsos->lsos_current_txg, ==, txg);
|
|
|
|
|
|
|
|
lsos->lsos_sm_entries++;
|
|
|
|
lsos->lsos_total_entries++;
|
|
|
|
|
|
|
|
vdev_t *vd = vdev_lookup_top(spa, vdev_id);
|
|
|
|
if (!vdev_is_concrete(vd))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
metaslab_t *ms = vd->vdev_ms[offset >> vd->vdev_ms_shift];
|
|
|
|
ASSERT(sme->sme_type == SM_ALLOC || sme->sme_type == SM_FREE);
|
|
|
|
|
|
|
|
if (txg < metaslab_unflushed_txg(ms))
|
|
|
|
return (0);
|
|
|
|
lsos->lsos_valid_sm_entries++;
|
|
|
|
lsos->lsos_valid_entries++;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_log_spacemap_obsolete_stats(spa_t *spa)
|
|
|
|
{
|
2019-07-18 22:54:03 +03:00
|
|
|
if (!spa_feature_is_active(spa, SPA_FEATURE_LOG_SPACEMAP))
|
|
|
|
return;
|
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
log_sm_obsolete_stats_arg_t lsos;
|
|
|
|
bzero(&lsos, sizeof (lsos));
|
|
|
|
|
|
|
|
(void) printf("Log Space Map Obsolete Entry Statistics:\n");
|
|
|
|
|
|
|
|
iterate_through_spacemap_logs(spa,
|
|
|
|
log_spacemap_obsolete_stats_cb, &lsos);
|
|
|
|
|
|
|
|
/* print stats for latest log */
|
|
|
|
(void) printf("%-8llu valid entries out of %-8llu - txg %llu\n",
|
|
|
|
(u_longlong_t)lsos.lsos_valid_sm_entries,
|
|
|
|
(u_longlong_t)lsos.lsos_sm_entries,
|
|
|
|
(u_longlong_t)lsos.lsos_current_txg);
|
|
|
|
|
|
|
|
(void) printf("%-8llu valid entries out of %-8llu - total\n\n",
|
|
|
|
(u_longlong_t)lsos.lsos_valid_entries,
|
|
|
|
(u_longlong_t)lsos.lsos_total_entries);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_zpool(spa_t *spa)
|
|
|
|
{
|
|
|
|
dsl_pool_t *dp = spa_get_dsl(spa);
|
|
|
|
int rc = 0;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['S']) {
|
|
|
|
dump_simulated_ddt(spa);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!dump_opt['e'] && dump_opt['C'] > 1) {
|
|
|
|
(void) printf("\nCached configuration:\n");
|
|
|
|
dump_nvlist(spa->spa_config, 8);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dump_opt['C'])
|
|
|
|
dump_config(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['u'])
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uberblock(&spa->spa_uberblock, "\nUberblock:\n", "\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['D'])
|
|
|
|
dump_all_ddts(spa);
|
|
|
|
|
|
|
|
if (dump_opt['d'] > 2 || dump_opt['m'])
|
|
|
|
dump_metaslabs(spa);
|
2014-07-20 00:19:24 +04:00
|
|
|
if (dump_opt['M'])
|
|
|
|
dump_metaslab_groups(spa);
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (dump_opt['d'] > 2 || dump_opt['m']) {
|
|
|
|
dump_log_spacemaps(spa);
|
|
|
|
dump_log_spacemap_obsolete_stats(spa);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (dump_opt['d'] || dump_opt['i']) {
|
2015-07-24 19:53:55 +03:00
|
|
|
spa_feature_t f;
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
mos_refd_objs = range_tree_create(NULL, RANGE_SEG64, NULL, 0,
|
|
|
|
0);
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_objset(dp->dp_meta_objset);
|
2018-10-12 21:28:26 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['d'] >= 3) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&spa->spa_deferred_bpobj,
|
2013-07-05 23:37:16 +04:00
|
|
|
"Deferred frees", 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dump_full_bpobj(&dp->dp_free_bpobj,
|
2013-07-05 23:37:16 +04:00
|
|
|
"Pool snapshot frees", 0);
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (bpobj_is_open(&dp->dp_obsolete_bpobj)) {
|
|
|
|
ASSERT(spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_DEVICE_REMOVAL));
|
|
|
|
dump_full_bpobj(&dp->dp_obsolete_bpobj,
|
|
|
|
"Pool obsolete blocks", 0);
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (spa_feature_is_active(spa,
|
2013-10-08 21:13:05 +04:00
|
|
|
SPA_FEATURE_ASYNC_DESTROY)) {
|
2012-12-14 03:24:15 +04:00
|
|
|
dump_bptree(spa->spa_meta_objset,
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dp->dp_bptree_obj,
|
2012-12-14 03:24:15 +04:00
|
|
|
"Pool dataset frees");
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_dtl(spa->spa_root_vdev, 0);
|
|
|
|
}
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
|
|
|
|
for (spa_feature_t f = 0; f < SPA_FEATURES; f++)
|
|
|
|
global_feature_count[f] = UINT64_MAX;
|
|
|
|
global_feature_count[SPA_FEATURE_REDACTION_BOOKMARKS] = 0;
|
|
|
|
global_feature_count[SPA_FEATURE_BOOKMARK_WRITTEN] = 0;
|
2019-07-26 20:54:14 +03:00
|
|
|
global_feature_count[SPA_FEATURE_LIVELIST] = 0;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
|
2019-07-26 20:54:14 +03:00
|
|
|
(void) dmu_objset_find(spa_name(spa), dump_one_objset,
|
2010-05-29 00:45:14 +04:00
|
|
|
NULL, DS_FIND_SNAPSHOTS | DS_FIND_CHILDREN);
|
2014-11-03 23:15:08 +03:00
|
|
|
|
2018-10-12 21:28:26 +03:00
|
|
|
if (rc == 0 && !dump_opt['L'])
|
|
|
|
rc = dump_mos_leaks(spa);
|
|
|
|
|
2015-07-24 19:53:55 +03:00
|
|
|
for (f = 0; f < SPA_FEATURES; f++) {
|
|
|
|
uint64_t refcount;
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
uint64_t *arr;
|
2015-07-24 19:53:55 +03:00
|
|
|
if (!(spa_feature_table[f].fi_flags &
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ZFEATURE_FLAG_PER_DATASET)) {
|
|
|
|
if (global_feature_count[f] == UINT64_MAX)
|
|
|
|
continue;
|
|
|
|
if (!spa_feature_is_enabled(spa, f)) {
|
|
|
|
ASSERT0(global_feature_count[f]);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
arr = global_feature_count;
|
|
|
|
} else {
|
|
|
|
if (!spa_feature_is_enabled(spa, f)) {
|
|
|
|
ASSERT0(dataset_feature_count[f]);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
arr = dataset_feature_count;
|
2015-07-24 19:53:55 +03:00
|
|
|
}
|
|
|
|
if (feature_get_refcount(spa, &spa_feature_table[f],
|
|
|
|
&refcount) == ENOTSUP)
|
|
|
|
continue;
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (arr[f] != refcount) {
|
2015-07-24 19:53:55 +03:00
|
|
|
(void) printf("%s feature refcount mismatch: "
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
"%lld consumers != %lld refcount\n",
|
2015-07-24 19:53:55 +03:00
|
|
|
spa_feature_table[f].fi_uname,
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
(longlong_t)arr[f], (longlong_t)refcount);
|
2015-06-25 07:05:32 +03:00
|
|
|
rc = 2;
|
|
|
|
} else {
|
2015-07-24 19:53:55 +03:00
|
|
|
(void) printf("Verified %s feature refcount "
|
|
|
|
"of %llu is correct\n",
|
|
|
|
spa_feature_table[f].fi_uname,
|
2015-06-25 07:05:32 +03:00
|
|
|
(longlong_t)refcount);
|
|
|
|
}
|
2014-11-03 23:15:08 +03:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
if (rc == 0)
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
rc = verify_device_removal_feature_counts(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2018-10-12 21:28:26 +03:00
|
|
|
|
2014-11-03 23:15:08 +03:00
|
|
|
if (rc == 0 && (dump_opt['b'] || dump_opt['c']))
|
2008-11-20 23:01:55 +03:00
|
|
|
rc = dump_block_stats(spa);
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (rc == 0)
|
|
|
|
rc = verify_spacemap_refcounts(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['s'])
|
|
|
|
show_pool_stats(spa);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['h'])
|
|
|
|
dump_history(spa);
|
|
|
|
|
2017-08-04 19:30:49 +03:00
|
|
|
if (rc == 0)
|
2016-12-17 01:11:29 +03:00
|
|
|
rc = verify_checkpoint(spa);
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
if (rc != 0) {
|
|
|
|
dump_debug_buffer();
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(rc);
|
2017-01-28 23:16:43 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#define ZDB_FLAG_CHECKSUM 0x0001
|
|
|
|
#define ZDB_FLAG_DECOMPRESS 0x0002
|
|
|
|
#define ZDB_FLAG_BSWAP 0x0004
|
|
|
|
#define ZDB_FLAG_GBH 0x0008
|
|
|
|
#define ZDB_FLAG_INDIRECT 0x0010
|
|
|
|
#define ZDB_FLAG_PHYS 0x0020
|
|
|
|
#define ZDB_FLAG_RAW 0x0040
|
|
|
|
#define ZDB_FLAG_PRINT_BLKPTR 0x0080
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static int flagbits[256];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_print_blkptr(blkptr_t *bp, int flags)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (flags & ZDB_FLAG_BSWAP)
|
|
|
|
byteswap_uint64_array((void *)bp, sizeof (blkptr_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%s\n", blkbuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_indirect(blkptr_t *bp, int nbps, int flags)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < nbps; i++)
|
|
|
|
zdb_print_blkptr(&bp[i], flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_gbh(void *buf, int flags)
|
|
|
|
{
|
|
|
|
zdb_dump_indirect((blkptr_t *)buf, SPA_GBH_NBLKPTRS, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_block_raw(void *buf, uint64_t size, int flags)
|
|
|
|
{
|
|
|
|
if (flags & ZDB_FLAG_BSWAP)
|
|
|
|
byteswap_uint64_array(buf, size);
|
2010-08-26 20:52:40 +04:00
|
|
|
VERIFY(write(fileno(stdout), buf, size) == size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_block(char *label, void *buf, uint64_t size, int flags)
|
|
|
|
{
|
|
|
|
uint64_t *d = (uint64_t *)buf;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned nwords = size / sizeof (uint64_t);
|
2008-11-20 23:01:55 +03:00
|
|
|
int do_bswap = !!(flags & ZDB_FLAG_BSWAP);
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i, j;
|
|
|
|
const char *hdr;
|
|
|
|
char *c;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
|
|
|
|
if (do_bswap)
|
|
|
|
hdr = " 7 6 5 4 3 2 1 0 f e d c b a 9 8";
|
|
|
|
else
|
|
|
|
hdr = " 0 1 2 3 4 5 6 7 8 9 a b c d e f";
|
|
|
|
|
|
|
|
(void) printf("\n%s\n%6s %s 0123456789abcdef\n", label, "", hdr);
|
|
|
|
|
2015-11-21 02:47:37 +03:00
|
|
|
#ifdef _LITTLE_ENDIAN
|
2017-01-03 20:31:18 +03:00
|
|
|
/* correct the endianness */
|
2015-11-21 02:47:37 +03:00
|
|
|
do_bswap = !do_bswap;
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < nwords; i += 2) {
|
|
|
|
(void) printf("%06llx: %016llx %016llx ",
|
|
|
|
(u_longlong_t)(i * sizeof (uint64_t)),
|
|
|
|
(u_longlong_t)(do_bswap ? BSWAP_64(d[i]) : d[i]),
|
|
|
|
(u_longlong_t)(do_bswap ? BSWAP_64(d[i + 1]) : d[i + 1]));
|
|
|
|
|
|
|
|
c = (char *)&d[i];
|
|
|
|
for (j = 0; j < 2 * sizeof (uint64_t); j++)
|
|
|
|
(void) printf("%c", isprint(c[j]) ? c[j] : '.');
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There are two acceptable formats:
|
|
|
|
* leaf_name - For example: c1t0d0 or /tmp/ztest.0a
|
|
|
|
* child[.child]* - For example: 0.1.1
|
|
|
|
*
|
|
|
|
* The second form can be used to specify arbitrary vdevs anywhere
|
2017-01-03 20:31:18 +03:00
|
|
|
* in the hierarchy. For example, in a pool with a mirror of
|
2008-11-20 23:01:55 +03:00
|
|
|
* RAID-Zs, you can specify either RAID-Z vdev with 0.0 or 0.1 .
|
|
|
|
*/
|
|
|
|
static vdev_t *
|
2017-10-27 22:46:35 +03:00
|
|
|
zdb_vdev_lookup(vdev_t *vdev, const char *path)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
char *s, *p, *q;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (vdev == NULL)
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
/* First, assume the x.x.x.x format */
|
2017-10-27 22:46:35 +03:00
|
|
|
i = strtoul(path, &s, 10);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (s == path || (s && *s != '.' && *s != '\0'))
|
|
|
|
goto name;
|
2017-10-27 22:46:35 +03:00
|
|
|
if (i >= vdev->vdev_children)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
vdev = vdev->vdev_child[i];
|
2016-10-24 23:37:38 +03:00
|
|
|
if (s && *s == '\0')
|
2008-11-20 23:01:55 +03:00
|
|
|
return (vdev);
|
|
|
|
return (zdb_vdev_lookup(vdev, s+1));
|
|
|
|
|
|
|
|
name:
|
|
|
|
for (i = 0; i < vdev->vdev_children; i++) {
|
|
|
|
vdev_t *vc = vdev->vdev_child[i];
|
|
|
|
|
|
|
|
if (vc->vdev_path == NULL) {
|
|
|
|
vc = zdb_vdev_lookup(vc, path);
|
|
|
|
if (vc == NULL)
|
|
|
|
continue;
|
|
|
|
else
|
|
|
|
return (vc);
|
|
|
|
}
|
|
|
|
|
|
|
|
p = strrchr(vc->vdev_path, '/');
|
|
|
|
p = p ? p + 1 : vc->vdev_path;
|
|
|
|
q = &vc->vdev_path[strlen(vc->vdev_path) - 2];
|
|
|
|
|
|
|
|
if (strcmp(vc->vdev_path, path) == 0)
|
|
|
|
return (vc);
|
|
|
|
if (strcmp(p, path) == 0)
|
|
|
|
return (vc);
|
|
|
|
if (strcmp(q, "s0") == 0 && strncmp(p, path, q - p) == 0)
|
|
|
|
return (vc);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read a block from a pool and print it out. The syntax of the
|
|
|
|
* block descriptor is:
|
|
|
|
*
|
|
|
|
* pool:vdev_specifier:offset:size[:flags]
|
|
|
|
*
|
|
|
|
* pool - The name of the pool you wish to read from
|
|
|
|
* vdev_specifier - Which vdev (see comment for zdb_vdev_lookup)
|
|
|
|
* offset - offset, in hex, in bytes
|
|
|
|
* size - Amount of data to read, in hex, in bytes
|
|
|
|
* flags - A string of characters specifying options
|
|
|
|
* b: Decode a blkptr at given offset within block
|
|
|
|
* *c: Calculate and display checksums
|
2010-05-29 00:45:14 +04:00
|
|
|
* d: Decompress data before dumping
|
2008-11-20 23:01:55 +03:00
|
|
|
* e: Byteswap data before dumping
|
2010-05-29 00:45:14 +04:00
|
|
|
* g: Display data as a gang block header
|
|
|
|
* i: Display as an indirect block
|
2008-11-20 23:01:55 +03:00
|
|
|
* p: Do I/O to physical offset
|
|
|
|
* r: Dump raw data to stdout
|
|
|
|
*
|
|
|
|
* * = not yet implemented
|
|
|
|
*/
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_read_block(char *thing, spa_t *spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
blkptr_t blk, *bp = &blk;
|
|
|
|
dva_t *dva = bp->blk_dva;
|
2008-11-20 23:01:55 +03:00
|
|
|
int flags = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset = 0, size = 0, psize = 0, lsize = 0, blkptr_offset = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_t *zio;
|
|
|
|
vdev_t *vd;
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_t *pabd;
|
|
|
|
void *lbuf, *buf;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *s, *vdev;
|
|
|
|
char *p, *dup, *flagstr;
|
2010-05-29 00:45:14 +04:00
|
|
|
int i, error;
|
2017-01-05 22:10:07 +03:00
|
|
|
boolean_t borrowed = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dup = strdup(thing);
|
|
|
|
s = strtok(dup, ":");
|
|
|
|
vdev = s ? s : "";
|
|
|
|
s = strtok(NULL, ":");
|
|
|
|
offset = strtoull(s ? s : "", NULL, 16);
|
|
|
|
s = strtok(NULL, ":");
|
|
|
|
size = strtoull(s ? s : "", NULL, 16);
|
|
|
|
s = strtok(NULL, ":");
|
2017-10-27 22:46:35 +03:00
|
|
|
if (s)
|
|
|
|
flagstr = strdup(s);
|
|
|
|
else
|
|
|
|
flagstr = strdup("");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
s = NULL;
|
|
|
|
if (size == 0)
|
|
|
|
s = "size must not be zero";
|
|
|
|
if (!IS_P2ALIGNED(size, DEV_BSIZE))
|
|
|
|
s = "size must be a multiple of sector size";
|
|
|
|
if (!IS_P2ALIGNED(offset, DEV_BSIZE))
|
|
|
|
s = "offset must be a multiple of sector size";
|
|
|
|
if (s) {
|
|
|
|
(void) printf("Invalid block specifier: %s - %s\n", thing, s);
|
2017-10-27 22:46:35 +03:00
|
|
|
free(flagstr);
|
2008-11-20 23:01:55 +03:00
|
|
|
free(dup);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (s = strtok(flagstr, ":"); s; s = strtok(NULL, ":")) {
|
|
|
|
for (i = 0; flagstr[i]; i++) {
|
|
|
|
int bit = flagbits[(uchar_t)flagstr[i]];
|
|
|
|
|
|
|
|
if (bit == 0) {
|
|
|
|
(void) printf("***Invalid flag: %c\n",
|
|
|
|
flagstr[i]);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
flags |= bit;
|
|
|
|
|
|
|
|
/* If it's not something with an argument, keep going */
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((bit & (ZDB_FLAG_CHECKSUM |
|
2008-11-20 23:01:55 +03:00
|
|
|
ZDB_FLAG_PRINT_BLKPTR)) == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
p = &flagstr[i + 1];
|
2016-02-03 19:07:34 +03:00
|
|
|
if (bit == ZDB_FLAG_PRINT_BLKPTR) {
|
2008-11-20 23:01:55 +03:00
|
|
|
blkptr_offset = strtoull(p, &p, 16);
|
2016-02-03 19:07:34 +03:00
|
|
|
i = p - &flagstr[i + 1];
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (*p != ':' && *p != '\0') {
|
|
|
|
(void) printf("***Invalid flag arg: '%s'\n", s);
|
2017-10-27 22:46:35 +03:00
|
|
|
free(flagstr);
|
2008-11-20 23:01:55 +03:00
|
|
|
free(dup);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2017-10-27 22:46:35 +03:00
|
|
|
free(flagstr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
vd = zdb_vdev_lookup(spa->spa_root_vdev, vdev);
|
|
|
|
if (vd == NULL) {
|
|
|
|
(void) printf("***Invalid vdev: %s\n", vdev);
|
|
|
|
free(dup);
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
if (vd->vdev_path)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, "Found vdev: %s\n",
|
|
|
|
vd->vdev_path);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, "Found vdev type: %s\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
vd->vdev_ops->vdev_op_type);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
psize = size;
|
|
|
|
lsize = size;
|
|
|
|
|
2017-01-05 22:10:07 +03:00
|
|
|
pabd = abd_alloc_for_io(SPA_MAXBLOCKSIZE, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
lbuf = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
BP_ZERO(bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
DVA_SET_VDEV(&dva[0], vd->vdev_id);
|
|
|
|
DVA_SET_OFFSET(&dva[0], offset);
|
|
|
|
DVA_SET_GANG(&dva[0], !!(flags & ZDB_FLAG_GBH));
|
|
|
|
DVA_SET_ASIZE(&dva[0], vdev_psize_to_asize(vd, psize));
|
|
|
|
|
|
|
|
BP_SET_BIRTH(bp, TXG_INITIAL, TXG_INITIAL);
|
|
|
|
|
|
|
|
BP_SET_LSIZE(bp, lsize);
|
|
|
|
BP_SET_PSIZE(bp, psize);
|
|
|
|
BP_SET_COMPRESS(bp, ZIO_COMPRESS_OFF);
|
|
|
|
BP_SET_CHECKSUM(bp, ZIO_CHECKSUM_OFF);
|
|
|
|
BP_SET_TYPE(bp, DMU_OT_NONE);
|
|
|
|
BP_SET_LEVEL(bp, 0);
|
|
|
|
BP_SET_DEDUP(bp, 0);
|
|
|
|
BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
zio = zio_root(spa, NULL, NULL, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (vd == vd->vdev_top) {
|
|
|
|
/*
|
|
|
|
* Treat this as a normal block read.
|
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_read(zio, spa, bp, pabd, psize, NULL, NULL,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZIO_PRIORITY_SYNC_READ,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_RAW, NULL));
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Treat this as a vdev child I/O.
|
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_vdev_child_io(zio, bp, vd, offset, pabd,
|
|
|
|
psize, ZIO_TYPE_READ, ZIO_PRIORITY_SYNC_READ,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_QUEUE |
|
|
|
|
ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_RAW | ZIO_FLAG_OPTIONAL,
|
|
|
|
NULL, NULL));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
error = zio_wait(zio);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error) {
|
|
|
|
(void) printf("Read of %s failed, error: %d\n", thing, error);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (flags & ZDB_FLAG_DECOMPRESS) {
|
|
|
|
/*
|
|
|
|
* We don't know how the data was compressed, so just try
|
|
|
|
* every decompress function at every inflated blocksize.
|
|
|
|
*/
|
|
|
|
enum zio_compress c;
|
|
|
|
void *lbuf2 = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
|
|
|
|
|
2016-06-29 23:59:51 +03:00
|
|
|
/*
|
|
|
|
* XXX - On the one hand, with SPA_MAXBLOCKSIZE at 16MB,
|
|
|
|
* this could take a while and we should let the user know
|
|
|
|
* we are not stuck. On the other hand, printing progress
|
|
|
|
* info gets old after a while. What to do?
|
|
|
|
*/
|
|
|
|
for (lsize = psize + SPA_MINBLOCKSIZE;
|
|
|
|
lsize <= SPA_MAXBLOCKSIZE; lsize += SPA_MINBLOCKSIZE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
for (c = 0; c < ZIO_COMPRESS_FUNCTIONS; c++) {
|
2018-02-02 03:19:36 +03:00
|
|
|
/*
|
|
|
|
* ZLE can easily decompress non zle stream.
|
|
|
|
* So have an option to disable it.
|
|
|
|
*/
|
|
|
|
if (c == ZIO_COMPRESS_ZLE &&
|
|
|
|
getenv("ZDB_NO_ZLE"))
|
|
|
|
continue;
|
|
|
|
|
2016-08-11 02:28:58 +03:00
|
|
|
(void) fprintf(stderr,
|
|
|
|
"Trying %05llx -> %05llx (%s)\n",
|
2016-06-29 23:59:51 +03:00
|
|
|
(u_longlong_t)psize, (u_longlong_t)lsize,
|
|
|
|
zio_compress_table[c].ci_name);
|
2018-02-02 03:19:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We randomize lbuf2, and decompress to both
|
|
|
|
* lbuf and lbuf2. This way, we will know if
|
|
|
|
* decompression fill exactly to lsize.
|
|
|
|
*/
|
|
|
|
VERIFY0(random_get_pseudo_bytes(lbuf2, lsize));
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
if (zio_decompress_data(c, pabd,
|
|
|
|
lbuf, psize, lsize) == 0 &&
|
2018-02-02 03:19:36 +03:00
|
|
|
zio_decompress_data(c, pabd,
|
2016-07-22 18:52:49 +03:00
|
|
|
lbuf2, psize, lsize) == 0 &&
|
2010-05-29 00:45:14 +04:00
|
|
|
bcmp(lbuf, lbuf2, lsize) == 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (c != ZIO_COMPRESS_FUNCTIONS)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
umem_free(lbuf2, SPA_MAXBLOCKSIZE);
|
|
|
|
|
2018-02-02 03:19:36 +03:00
|
|
|
if (lsize > SPA_MAXBLOCKSIZE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Decompress of %s failed\n", thing);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
buf = lbuf;
|
|
|
|
size = lsize;
|
|
|
|
} else {
|
|
|
|
size = psize;
|
2017-01-05 22:10:07 +03:00
|
|
|
buf = abd_borrow_buf_copy(pabd, size);
|
|
|
|
borrowed = B_TRUE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (flags & ZDB_FLAG_PRINT_BLKPTR)
|
|
|
|
zdb_print_blkptr((blkptr_t *)(void *)
|
|
|
|
((uintptr_t)buf + (uintptr_t)blkptr_offset), flags);
|
|
|
|
else if (flags & ZDB_FLAG_RAW)
|
|
|
|
zdb_dump_block_raw(buf, size, flags);
|
|
|
|
else if (flags & ZDB_FLAG_INDIRECT)
|
|
|
|
zdb_dump_indirect((blkptr_t *)buf, size / sizeof (blkptr_t),
|
|
|
|
flags);
|
|
|
|
else if (flags & ZDB_FLAG_GBH)
|
|
|
|
zdb_dump_gbh(buf, flags);
|
|
|
|
else
|
|
|
|
zdb_dump_block(thing, buf, size, flags);
|
|
|
|
|
2017-01-05 22:10:07 +03:00
|
|
|
if (borrowed)
|
|
|
|
abd_return_buf_copy(pabd, buf, size);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
out:
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_free(pabd);
|
2010-05-29 00:45:14 +04:00
|
|
|
umem_free(lbuf, SPA_MAXBLOCKSIZE);
|
2008-11-20 23:01:55 +03:00
|
|
|
free(dup);
|
|
|
|
}
|
|
|
|
|
2017-05-01 21:06:07 +03:00
|
|
|
static void
|
|
|
|
zdb_embedded_block(char *thing)
|
|
|
|
{
|
|
|
|
blkptr_t bp;
|
|
|
|
unsigned long long *words = (void *)&bp;
|
2018-02-02 03:28:11 +03:00
|
|
|
char *buf;
|
2017-05-01 21:06:07 +03:00
|
|
|
int err;
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&bp, sizeof (bp));
|
2017-05-01 21:06:07 +03:00
|
|
|
err = sscanf(thing, "%llx:%llx:%llx:%llx:%llx:%llx:%llx:%llx:"
|
|
|
|
"%llx:%llx:%llx:%llx:%llx:%llx:%llx:%llx",
|
|
|
|
words + 0, words + 1, words + 2, words + 3,
|
|
|
|
words + 4, words + 5, words + 6, words + 7,
|
|
|
|
words + 8, words + 9, words + 10, words + 11,
|
|
|
|
words + 12, words + 13, words + 14, words + 15);
|
|
|
|
if (err != 16) {
|
2018-05-07 11:35:50 +03:00
|
|
|
(void) fprintf(stderr, "invalid input format\n");
|
2017-05-01 21:06:07 +03:00
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
ASSERT3U(BPE_GET_LSIZE(&bp), <=, SPA_MAXBLOCKSIZE);
|
2018-05-07 11:35:50 +03:00
|
|
|
buf = malloc(SPA_MAXBLOCKSIZE);
|
|
|
|
if (buf == NULL) {
|
|
|
|
(void) fprintf(stderr, "out of memory\n");
|
|
|
|
exit(1);
|
|
|
|
}
|
2017-05-01 21:06:07 +03:00
|
|
|
err = decode_embedded_bp(&bp, buf, BPE_GET_LSIZE(&bp));
|
|
|
|
if (err != 0) {
|
2018-05-07 11:35:50 +03:00
|
|
|
(void) fprintf(stderr, "decode failed: %u\n", err);
|
2017-05-01 21:06:07 +03:00
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
zdb_dump_block_raw(buf, BPE_GET_LSIZE(&bp), 0);
|
2018-05-07 11:35:50 +03:00
|
|
|
free(buf);
|
2017-05-01 21:06:07 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
|
|
|
main(int argc, char **argv)
|
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
int c;
|
2008-11-20 23:01:55 +03:00
|
|
|
struct rlimit rl = { 1024, 1024 };
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *spa = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
objset_t *os = NULL;
|
|
|
|
int dump_all = 1;
|
|
|
|
int verbose = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
int error = 0;
|
|
|
|
char **searchdirs = NULL;
|
|
|
|
int nsearch = 0;
|
2018-02-02 03:36:40 +03:00
|
|
|
char *target, *target_pool;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *policy = NULL;
|
|
|
|
uint64_t max_txg = UINT64_MAX;
|
2014-06-08 22:10:14 +04:00
|
|
|
int flags = ZFS_IMPORT_MISSING_LOG;
|
2010-05-29 00:45:14 +04:00
|
|
|
int rewind = ZPOOL_NEVER_REWIND;
|
2013-06-24 10:45:20 +04:00
|
|
|
char *spa_config_path_env;
|
2015-05-14 20:45:56 +03:00
|
|
|
boolean_t target_is_spa = B_TRUE;
|
2016-12-17 01:11:29 +03:00
|
|
|
nvlist_t *cfg = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) setrlimit(RLIMIT_NOFILE, &rl);
|
|
|
|
(void) enable_extended_FILE_stdio(-1, -1);
|
|
|
|
|
|
|
|
dprintf_setup(&argc, argv);
|
|
|
|
|
2013-06-24 10:45:20 +04:00
|
|
|
/*
|
|
|
|
* If there is an environment variable SPA_CONFIG_PATH it overrides
|
|
|
|
* default spa_config_path setting. If -U flag is specified it will
|
|
|
|
* override this environment variable settings once again.
|
|
|
|
*/
|
|
|
|
spa_config_path_env = getenv("SPA_CONFIG_PATH");
|
|
|
|
if (spa_config_path_env != NULL)
|
|
|
|
spa_config_path = spa_config_path_env;
|
|
|
|
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
/*
|
|
|
|
* For performance reasons, we set this tunable down. We do so before
|
|
|
|
* the arg parsing section so that the user can override this value if
|
|
|
|
* they choose.
|
|
|
|
*/
|
|
|
|
zfs_btree_verify_intensity = 3;
|
|
|
|
|
2016-01-01 16:42:58 +03:00
|
|
|
while ((c = getopt(argc, argv,
|
2019-01-17 01:10:02 +03:00
|
|
|
"AbcCdDeEFGhiI:klLmMo:Op:PqRsSt:uU:vVx:XY")) != -1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
switch (c) {
|
|
|
|
case 'b':
|
|
|
|
case 'c':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'C':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'd':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'D':
|
2017-05-01 21:06:07 +03:00
|
|
|
case 'E':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'G':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'h':
|
|
|
|
case 'i':
|
|
|
|
case 'l':
|
2009-07-03 02:44:48 +04:00
|
|
|
case 'm':
|
2014-07-20 00:19:24 +04:00
|
|
|
case 'M':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'O':
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'R':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 's':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'S':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'u':
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_opt[c]++;
|
|
|
|
dump_all = 0;
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'A':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'e':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'F':
|
2016-12-17 01:11:29 +03:00
|
|
|
case 'k':
|
2009-01-16 00:59:39 +03:00
|
|
|
case 'L':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'P':
|
2017-02-04 01:18:28 +03:00
|
|
|
case 'q':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'X':
|
2009-01-16 00:59:39 +03:00
|
|
|
dump_opt[c]++;
|
|
|
|
break;
|
2019-01-17 01:10:02 +03:00
|
|
|
case 'Y':
|
|
|
|
zfs_reconstruct_indirect_combinations_max = INT_MAX;
|
|
|
|
zfs_deadman_enabled = 0;
|
|
|
|
break;
|
2017-04-13 19:40:56 +03:00
|
|
|
/* NB: Sort single match options below. */
|
2014-07-20 00:19:24 +04:00
|
|
|
case 'I':
|
2019-08-13 17:11:57 +03:00
|
|
|
max_inflight_bytes = strtoull(optarg, NULL, 0);
|
|
|
|
if (max_inflight_bytes == 0) {
|
2013-05-03 03:36:32 +04:00
|
|
|
(void) fprintf(stderr, "maximum number "
|
2019-08-13 17:11:57 +03:00
|
|
|
"of inflight bytes must be greater "
|
2013-05-03 03:36:32 +04:00
|
|
|
"than 0\n");
|
|
|
|
usage();
|
|
|
|
}
|
|
|
|
break;
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'o':
|
|
|
|
error = set_global_var(optarg);
|
|
|
|
if (error != 0)
|
|
|
|
usage();
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'p':
|
2010-05-29 00:45:14 +04:00
|
|
|
if (searchdirs == NULL) {
|
|
|
|
searchdirs = umem_alloc(sizeof (char *),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
} else {
|
|
|
|
char **tmp = umem_alloc((nsearch + 1) *
|
|
|
|
sizeof (char *), UMEM_NOFAIL);
|
|
|
|
bcopy(searchdirs, tmp, nsearch *
|
|
|
|
sizeof (char *));
|
|
|
|
umem_free(searchdirs,
|
|
|
|
nsearch * sizeof (char *));
|
|
|
|
searchdirs = tmp;
|
|
|
|
}
|
|
|
|
searchdirs[nsearch++] = optarg;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2009-01-16 00:59:39 +03:00
|
|
|
case 't':
|
2010-05-29 00:45:14 +04:00
|
|
|
max_txg = strtoull(optarg, NULL, 0);
|
|
|
|
if (max_txg < TXG_INITIAL) {
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) fprintf(stderr, "incorrect txg "
|
|
|
|
"specified: %s\n", optarg);
|
|
|
|
usage();
|
|
|
|
}
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'U':
|
|
|
|
spa_config_path = optarg;
|
2017-05-01 21:06:07 +03:00
|
|
|
if (spa_config_path[0] != '/') {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"cachefile must be an absolute path "
|
|
|
|
"(i.e. start with a slash)\n");
|
|
|
|
usage();
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
2014-07-20 00:19:24 +04:00
|
|
|
case 'v':
|
|
|
|
verbose++;
|
|
|
|
break;
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'V':
|
2017-04-14 00:28:46 +03:00
|
|
|
flags = ZFS_IMPORT_VERBATIM;
|
2017-04-13 19:40:56 +03:00
|
|
|
break;
|
2017-01-31 21:13:10 +03:00
|
|
|
case 'x':
|
|
|
|
vn_dumpdir = optarg;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
default:
|
|
|
|
usage();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['e'] && searchdirs != NULL) {
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) fprintf(stderr, "-p option requires use of -e\n");
|
|
|
|
usage();
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-10-24 02:26:49 +04:00
|
|
|
#if defined(_LP64)
|
2014-09-17 00:24:48 +04:00
|
|
|
/*
|
|
|
|
* ZDB does not typically re-read blocks; therefore limit the ARC
|
|
|
|
* to 256 MB, which can be used entirely for metadata.
|
|
|
|
*/
|
|
|
|
zfs_arc_max = zfs_arc_meta_limit = 256 * 1024 * 1024;
|
2014-10-24 02:26:49 +04:00
|
|
|
#endif
|
2014-09-17 00:24:48 +04:00
|
|
|
|
2015-05-15 02:41:29 +03:00
|
|
|
/*
|
|
|
|
* "zdb -c" uses checksum-verifying scrub i/os which are async reads.
|
|
|
|
* "zdb -b" uses traversal prefetch which uses async reads.
|
|
|
|
* For good performance, let several of them be active at once.
|
|
|
|
*/
|
|
|
|
zfs_vdev_async_read_max_active = 10;
|
|
|
|
|
2017-02-01 01:36:35 +03:00
|
|
|
/*
|
|
|
|
* Disable reference tracking for better performance.
|
|
|
|
*/
|
|
|
|
reference_tracking_enable = B_FALSE;
|
|
|
|
|
2018-01-31 02:25:19 +03:00
|
|
|
/*
|
|
|
|
* Do not fail spa_load when spa_load_verify fails. This is needed
|
|
|
|
* to load non-idle pools.
|
|
|
|
*/
|
|
|
|
spa_load_verify_dryrun = B_TRUE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
kernel_init(FREAD);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_all)
|
|
|
|
verbose = MAX(verbose, 1);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
for (c = 0; c < 256; c++) {
|
2016-12-17 01:11:29 +03:00
|
|
|
if (dump_all && strchr("AeEFklLOPRSX", c) == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_opt[c] = 1;
|
|
|
|
if (dump_opt[c])
|
|
|
|
dump_opt[c] += verbose;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
aok = (dump_opt['A'] == 1) || (dump_opt['A'] > 2);
|
|
|
|
zfs_recover = (dump_opt['A'] > 1);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
argc -= optind;
|
|
|
|
argv += optind;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (argc < 2 && dump_opt['R'])
|
|
|
|
usage();
|
2017-05-01 21:06:07 +03:00
|
|
|
|
|
|
|
if (dump_opt['E']) {
|
|
|
|
if (argc != 1)
|
|
|
|
usage();
|
|
|
|
zdb_embedded_block(argv[0]);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (argc < 1) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['e'] && dump_opt['C']) {
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_cachefile(spa_config_path);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
usage();
|
|
|
|
}
|
|
|
|
|
2017-02-04 01:18:28 +03:00
|
|
|
if (dump_opt['l'])
|
|
|
|
return (dump_label(argv[0]));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
if (dump_opt['O']) {
|
|
|
|
if (argc != 2)
|
|
|
|
usage();
|
|
|
|
dump_opt['v'] = verbose + 3;
|
|
|
|
return (dump_path(argv[0], argv[1]));
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['X'] || dump_opt['F'])
|
|
|
|
rewind = ZPOOL_DO_REWIND |
|
|
|
|
(dump_opt['X'] ? ZPOOL_EXTREME_REWIND : 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (nvlist_alloc(&policy, NV_UNIQUE_NAME_TYPE, 0) != 0 ||
|
2017-02-11 01:51:09 +03:00
|
|
|
nvlist_add_uint64(policy, ZPOOL_LOAD_REQUEST_TXG, max_txg) != 0 ||
|
|
|
|
nvlist_add_uint32(policy, ZPOOL_LOAD_REWIND_POLICY, rewind) != 0)
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal("internal error: %s", strerror(ENOMEM));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
error = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
target = argv[0];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-02-02 03:36:40 +03:00
|
|
|
if (strpbrk(target, "/@") != NULL) {
|
|
|
|
size_t targetlen;
|
|
|
|
|
|
|
|
target_pool = strdup(target);
|
|
|
|
*strpbrk(target_pool, "/@") = '\0';
|
|
|
|
|
|
|
|
target_is_spa = B_FALSE;
|
|
|
|
targetlen = strlen(target);
|
|
|
|
if (targetlen && target[targetlen - 1] == '/')
|
|
|
|
target[targetlen - 1] = '\0';
|
|
|
|
} else {
|
|
|
|
target_pool = target;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['e']) {
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
importargs_t args = { 0 };
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
args.paths = nsearch;
|
|
|
|
args.path = searchdirs;
|
|
|
|
args.can_be_active = B_TRUE;
|
|
|
|
|
2018-11-05 22:22:33 +03:00
|
|
|
error = zpool_find_config(NULL, target_pool, &cfg, &args,
|
|
|
|
&libzpool_config_ops);
|
2018-02-02 03:36:40 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (error == 0) {
|
2018-02-02 03:36:40 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (nvlist_add_nvlist(cfg,
|
2017-02-11 01:51:09 +03:00
|
|
|
ZPOOL_LOAD_POLICY, policy) != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal("can't open '%s': %s",
|
|
|
|
target, strerror(ENOMEM));
|
|
|
|
}
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
|
|
|
if (dump_opt['C'] > 1) {
|
|
|
|
(void) printf("\nConfiguration for import:\n");
|
|
|
|
dump_nvlist(cfg, 8);
|
|
|
|
}
|
2016-12-17 01:11:29 +03:00
|
|
|
|
2018-08-20 20:05:23 +03:00
|
|
|
/*
|
|
|
|
* Disable the activity check to allow examination of
|
|
|
|
* active pools.
|
|
|
|
*/
|
2018-02-02 03:36:40 +03:00
|
|
|
error = spa_import(target_pool, cfg, NULL,
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
flags | ZFS_IMPORT_SKIP_MMP);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2018-10-30 19:46:18 +03:00
|
|
|
/*
|
|
|
|
* import_checkpointed_state makes the assumption that the
|
|
|
|
* target pool that we pass it is already part of the spa
|
|
|
|
* namespace. Because of that we need to make sure to call
|
|
|
|
* it always after the -e option has been processed, which
|
|
|
|
* imports the pool to the namespace if it's not in the
|
|
|
|
* cachefile.
|
|
|
|
*/
|
|
|
|
char *checkpoint_pool = NULL;
|
|
|
|
char *checkpoint_target = NULL;
|
|
|
|
if (dump_opt['k']) {
|
|
|
|
checkpoint_pool = import_checkpointed_state(target, cfg,
|
|
|
|
&checkpoint_target);
|
|
|
|
|
|
|
|
if (checkpoint_target != NULL)
|
|
|
|
target = checkpoint_target;
|
|
|
|
}
|
|
|
|
|
2018-02-02 03:36:40 +03:00
|
|
|
if (target_pool != target)
|
|
|
|
free(target_pool);
|
2015-05-14 20:45:56 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (error == 0) {
|
2016-12-17 01:11:29 +03:00
|
|
|
if (dump_opt['k'] && (target_is_spa || dump_opt['R'])) {
|
|
|
|
ASSERT(checkpoint_pool != NULL);
|
|
|
|
ASSERT(checkpoint_target == NULL);
|
|
|
|
|
|
|
|
error = spa_open(checkpoint_pool, &spa, FTAG);
|
|
|
|
if (error != 0) {
|
|
|
|
fatal("Tried to open pool \"%s\" but "
|
|
|
|
"spa_open() failed with error %d\n",
|
|
|
|
checkpoint_pool, error);
|
|
|
|
}
|
|
|
|
|
|
|
|
} else if (target_is_spa || dump_opt['R']) {
|
2018-08-20 20:05:23 +03:00
|
|
|
zdb_set_skip_mmp(target);
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_open_rewind(target, &spa, FTAG, policy,
|
|
|
|
NULL);
|
|
|
|
if (error) {
|
|
|
|
/*
|
|
|
|
* If we're missing the log device then
|
|
|
|
* try opening the pool after clearing the
|
|
|
|
* log state.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((spa = spa_lookup(target)) != NULL &&
|
|
|
|
spa->spa_log_state == SPA_LOG_MISSING) {
|
|
|
|
spa->spa_log_state = SPA_LOG_CLEAR;
|
|
|
|
error = 0;
|
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
if (!error) {
|
|
|
|
error = spa_open_rewind(target, &spa,
|
|
|
|
FTAG, policy, NULL);
|
|
|
|
}
|
|
|
|
}
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
} else if (strpbrk(target, "#") != NULL) {
|
|
|
|
dsl_pool_t *dp;
|
|
|
|
error = dsl_pool_hold(target, FTAG, &dp);
|
|
|
|
if (error != 0) {
|
|
|
|
fatal("can't dump '%s': %s", target,
|
|
|
|
strerror(error));
|
|
|
|
}
|
|
|
|
error = dump_bookmark(dp, target, B_TRUE, verbose > 1);
|
|
|
|
dsl_pool_rele(dp, FTAG);
|
|
|
|
if (error != 0) {
|
|
|
|
fatal("can't dump '%s': %s", target,
|
|
|
|
strerror(error));
|
|
|
|
}
|
|
|
|
return (error);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2018-08-20 20:05:23 +03:00
|
|
|
zdb_set_skip_mmp(target);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
error = open_objset(target, FTAG, &os);
|
2018-01-09 03:15:23 +03:00
|
|
|
if (error == 0)
|
|
|
|
spa = dmu_objset_spa(os);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_free(policy);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error)
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal("can't open '%s': %s", target, strerror(error));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-01-09 03:15:23 +03:00
|
|
|
/*
|
|
|
|
* Set the pool failure mode to panic in order to prevent the pool
|
|
|
|
* from suspending. A suspended I/O will have no way to resume and
|
|
|
|
* can prevent the zdb(8) command from terminating as expected.
|
|
|
|
*/
|
|
|
|
if (spa != NULL)
|
|
|
|
spa->spa_failmode = ZIO_FAILURE_MODE_PANIC;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
argv++;
|
2010-05-29 00:45:14 +04:00
|
|
|
argc--;
|
|
|
|
if (!dump_opt['R']) {
|
|
|
|
if (argc > 0) {
|
|
|
|
zopt_objects = argc;
|
|
|
|
zopt_object = calloc(zopt_objects, sizeof (uint64_t));
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned i = 0; i < zopt_objects; i++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
errno = 0;
|
|
|
|
zopt_object[i] = strtoull(argv[i], NULL, 0);
|
|
|
|
if (zopt_object[i] == 0 && errno != 0)
|
|
|
|
fatal("bad number %s: %s",
|
|
|
|
argv[i], strerror(errno));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2013-01-12 04:42:50 +04:00
|
|
|
if (os != NULL) {
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_objset(os);
|
2013-01-12 04:42:50 +04:00
|
|
|
} else if (zopt_objects > 0 && !dump_opt['m']) {
|
2019-07-26 20:54:14 +03:00
|
|
|
dump_objset(spa->spa_meta_objset);
|
2013-01-12 04:42:50 +04:00
|
|
|
} else {
|
|
|
|
dump_zpool(spa);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
flagbits['b'] = ZDB_FLAG_PRINT_BLKPTR;
|
|
|
|
flagbits['c'] = ZDB_FLAG_CHECKSUM;
|
|
|
|
flagbits['d'] = ZDB_FLAG_DECOMPRESS;
|
|
|
|
flagbits['e'] = ZDB_FLAG_BSWAP;
|
|
|
|
flagbits['g'] = ZDB_FLAG_GBH;
|
|
|
|
flagbits['i'] = ZDB_FLAG_INDIRECT;
|
|
|
|
flagbits['p'] = ZDB_FLAG_PHYS;
|
|
|
|
flagbits['r'] = ZDB_FLAG_RAW;
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (int i = 0; i < argc; i++)
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_read_block(argv[i], spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-12-17 01:11:29 +03:00
|
|
|
if (dump_opt['k']) {
|
|
|
|
free(checkpoint_pool);
|
|
|
|
if (!target_is_spa)
|
|
|
|
free(checkpoint_target);
|
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (os != NULL) {
|
2017-04-13 19:40:56 +03:00
|
|
|
close_objset(os, FTAG);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
} else {
|
2017-04-13 19:40:56 +03:00
|
|
|
spa_close(spa, FTAG);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
fuid_table_destroy();
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
dump_debug_buffer();
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
kernel_fini();
|
|
|
|
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|