2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
2012-12-14 03:24:15 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* Copyright (c) 2011, 2017 by Delphix. All rights reserved.
|
2017-04-13 19:40:56 +03:00
|
|
|
* Copyright (c) 2014 Integros [integros.com]
|
2017-02-04 01:18:28 +03:00
|
|
|
* Copyright 2016 Nexenta Systems, Inc.
|
2017-04-13 19:40:56 +03:00
|
|
|
* Copyright (c) 2017 Lawrence Livermore National Security, LLC.
|
|
|
|
* Copyright (c) 2015, 2017, Intel Corporation.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <stdio.h>
|
2013-03-25 01:24:51 +04:00
|
|
|
#include <unistd.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <stdio_ext.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
#include <ctype.h>
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/spa_impl.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
|
|
|
#include <sys/zfs_znode.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/zfs_sa.h>
|
|
|
|
#include <sys/sa.h>
|
|
|
|
#include <sys/sa_impl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/vdev.h>
|
|
|
|
#include <sys/vdev_impl.h>
|
|
|
|
#include <sys/metaslab_impl.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/dsl_dir.h>
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
#include <sys/dsl_pool.h>
|
|
|
|
#include <sys/dbuf.h>
|
|
|
|
#include <sys/zil.h>
|
|
|
|
#include <sys/zil_impl.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <sys/resource.h>
|
|
|
|
#include <sys/dmu_traverse.h>
|
|
|
|
#include <sys/zio_checksum.h>
|
|
|
|
#include <sys/zio_compress.h>
|
|
|
|
#include <sys/zfs_fuid.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/arc.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/ddt.h>
|
2012-12-14 03:24:15 +04:00
|
|
|
#include <sys/zfeature.h>
|
2016-07-22 18:52:49 +03:00
|
|
|
#include <sys/abd.h>
|
2017-05-01 21:06:07 +03:00
|
|
|
#include <sys/blkptr.h>
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#include <sys/dsl_crypt.h>
|
2013-08-28 15:45:09 +04:00
|
|
|
#include <zfs_comutil.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <libzfs.h>
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
#include "zdb.h"
|
|
|
|
|
2013-01-12 04:42:50 +04:00
|
|
|
#define ZDB_COMPRESS_NAME(idx) ((idx) < ZIO_COMPRESS_FUNCTIONS ? \
|
|
|
|
zio_compress_table[(idx)].ci_name : "UNKNOWN")
|
|
|
|
#define ZDB_CHECKSUM_NAME(idx) ((idx) < ZIO_CHECKSUM_FUNCTIONS ? \
|
|
|
|
zio_checksum_table[(idx)].ci_name : "UNKNOWN")
|
|
|
|
#define ZDB_OT_TYPE(idx) ((idx) < DMU_OT_NUMTYPES ? (idx) : \
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
(idx) == DMU_OTN_ZAP_DATA || (idx) == DMU_OTN_ZAP_METADATA ? \
|
|
|
|
DMU_OT_ZAP_OTHER : \
|
|
|
|
(idx) == DMU_OTN_UINT64_DATA || (idx) == DMU_OTN_UINT64_METADATA ? \
|
|
|
|
DMU_OT_UINT64_OTHER : DMU_OT_NUMTYPES)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-08-01 20:42:04 +03:00
|
|
|
static char *
|
|
|
|
zdb_ot_name(dmu_object_type_t type)
|
|
|
|
{
|
|
|
|
if (type < DMU_OT_NUMTYPES)
|
|
|
|
return (dmu_ot[type].ot_name);
|
|
|
|
else if ((type & DMU_OT_NEWTYPE) &&
|
2016-12-12 21:46:26 +03:00
|
|
|
((type & DMU_OT_BYTESWAP_MASK) < DMU_BSWAP_NUMFUNCS))
|
2016-08-01 20:42:04 +03:00
|
|
|
return (dmu_ot_byteswap[type & DMU_OT_BYTESWAP_MASK].ob_name);
|
|
|
|
else
|
|
|
|
return ("UNKNOWN");
|
|
|
|
}
|
|
|
|
|
2017-02-01 01:36:35 +03:00
|
|
|
extern int reference_tracking_enable;
|
2010-05-29 00:45:14 +04:00
|
|
|
extern int zfs_recover;
|
2014-09-17 00:24:48 +04:00
|
|
|
extern uint64_t zfs_arc_max, zfs_arc_meta_limit;
|
2015-05-15 02:41:29 +03:00
|
|
|
extern int zfs_vdev_async_read_max_active;
|
2018-01-31 02:25:19 +03:00
|
|
|
extern boolean_t spa_load_verify_dryrun;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char cmdname[] = "zdb";
|
2008-11-20 23:01:55 +03:00
|
|
|
uint8_t dump_opt[256];
|
|
|
|
|
|
|
|
typedef void object_viewer_t(objset_t *, uint64_t, void *data, size_t size);
|
|
|
|
|
|
|
|
uint64_t *zopt_object = NULL;
|
2017-10-27 22:46:35 +03:00
|
|
|
static unsigned zopt_objects = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
libzfs_handle_t *g_zfs;
|
2014-09-17 00:24:48 +04:00
|
|
|
uint64_t max_inflight = 1000;
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
static int leaked_objects = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-27 01:27:36 +03:00
|
|
|
static void snprintf_blkptr_compact(char *, size_t, const blkptr_t *);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* These libumem hooks provide a reasonable set of defaults for the allocator's
|
|
|
|
* debugging facilities.
|
|
|
|
*/
|
|
|
|
const char *
|
2010-08-26 20:52:41 +04:00
|
|
|
_umem_debug_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
return ("default,verbose"); /* $UMEM_DEBUG setting */
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *
|
|
|
|
_umem_logging_init(void)
|
|
|
|
{
|
|
|
|
return ("fail,contents"); /* $UMEM_LOGGING setting */
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
usage(void)
|
|
|
|
{
|
|
|
|
(void) fprintf(stderr,
|
2017-04-14 00:28:46 +03:00
|
|
|
"Usage:\t%s [-AbcdDFGhiLMPsvX] [-e [-V] [-p <path> ...]] "
|
2017-04-13 19:40:56 +03:00
|
|
|
"[-I <inflight I/Os>]\n"
|
|
|
|
"\t\t[-o <var>=<value>]... [-t <txg>] [-U <cache>] [-x <dumpdir>]\n"
|
|
|
|
"\t\t[<poolname> [<object> ...]]\n"
|
2018-01-30 02:05:03 +03:00
|
|
|
"\t%s [-AdiPv] [-e [-V] [-p <path> ...]] [-U <cache>] <dataset>\n"
|
|
|
|
"\t\t[<object> ...]\n"
|
2017-04-13 19:40:56 +03:00
|
|
|
"\t%s -C [-A] [-U <cache>]\n"
|
|
|
|
"\t%s -l [-Aqu] <device>\n"
|
2017-04-14 00:28:46 +03:00
|
|
|
"\t%s -m [-AFLPX] [-e [-V] [-p <path> ...]] [-t <txg>] "
|
|
|
|
"[-U <cache>]\n\t\t<poolname> [<vdev> [<metaslab> ...]]\n"
|
2017-04-13 19:40:56 +03:00
|
|
|
"\t%s -O <dataset> <path>\n"
|
2017-04-14 00:28:46 +03:00
|
|
|
"\t%s -R [-A] [-e [-V] [-p <path> ...]] [-U <cache>]\n"
|
2017-04-13 19:40:56 +03:00
|
|
|
"\t\t<poolname> <vdev>:<offset>:<size>[:<flags>]\n"
|
2017-05-01 21:06:07 +03:00
|
|
|
"\t%s -E [-A] word0:word1:...:word15\n"
|
2017-04-14 00:28:46 +03:00
|
|
|
"\t%s -S [-AP] [-e [-V] [-p <path> ...]] [-U <cache>] "
|
|
|
|
"<poolname>\n\n",
|
2017-04-13 19:40:56 +03:00
|
|
|
cmdname, cmdname, cmdname, cmdname, cmdname, cmdname, cmdname,
|
2017-05-01 21:06:07 +03:00
|
|
|
cmdname, cmdname);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) fprintf(stderr, " Dataset name must include at least one "
|
|
|
|
"separator character '/' or '@'\n");
|
|
|
|
(void) fprintf(stderr, " If dataset name is specified, only that "
|
|
|
|
"dataset is dumped\n");
|
|
|
|
(void) fprintf(stderr, " If object numbers are specified, only "
|
|
|
|
"those objects are dumped\n\n");
|
|
|
|
(void) fprintf(stderr, " Options to control amount of output:\n");
|
|
|
|
(void) fprintf(stderr, " -b block statistics\n");
|
|
|
|
(void) fprintf(stderr, " -c checksum all metadata (twice for "
|
2009-07-03 02:44:48 +04:00
|
|
|
"all data) blocks\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -C config (or cachefile if alone)\n");
|
|
|
|
(void) fprintf(stderr, " -d dataset(s)\n");
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, " -D dedup statistics\n");
|
2017-05-01 21:06:07 +03:00
|
|
|
(void) fprintf(stderr, " -E decode and display block from an "
|
|
|
|
"embedded block pointer\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -h pool history\n");
|
|
|
|
(void) fprintf(stderr, " -i intent logs\n");
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) fprintf(stderr, " -l read label contents\n");
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) fprintf(stderr, " -L disable leak tracking (do not "
|
|
|
|
"load spacemaps)\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -m metaslabs\n");
|
|
|
|
(void) fprintf(stderr, " -M metaslab groups\n");
|
|
|
|
(void) fprintf(stderr, " -O perform object lookups by path\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, " -R read and display block from a "
|
2017-04-13 19:40:56 +03:00
|
|
|
"device\n");
|
|
|
|
(void) fprintf(stderr, " -s report stats on zdb's I/O\n");
|
|
|
|
(void) fprintf(stderr, " -S simulate dedup to measure effect\n");
|
|
|
|
(void) fprintf(stderr, " -v verbose (applies to all "
|
|
|
|
"others)\n\n");
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, " Below options are intended for use "
|
2016-01-01 16:42:58 +03:00
|
|
|
"with other options:\n");
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, " -A ignore assertions (-A), enable "
|
|
|
|
"panic recovery (-AA) or both (-AAA)\n");
|
|
|
|
(void) fprintf(stderr, " -e pool is exported/destroyed/"
|
|
|
|
"has altroot/not in a cachefile\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -F attempt automatic rewind within "
|
|
|
|
"safe range of transaction groups\n");
|
2017-01-28 23:16:43 +03:00
|
|
|
(void) fprintf(stderr, " -G dump zfs_dbgmsg buffer before "
|
|
|
|
"exiting\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -I <number of inflight I/Os> -- "
|
|
|
|
"specify the maximum number of\n "
|
|
|
|
"checksumming I/Os [default is 200]\n");
|
2017-01-31 21:13:10 +03:00
|
|
|
(void) fprintf(stderr, " -o <variable>=<value> set global "
|
2017-04-13 19:40:56 +03:00
|
|
|
"variable to an unsigned 32-bit integer\n");
|
|
|
|
(void) fprintf(stderr, " -p <path> -- use one or more with "
|
|
|
|
"-e to specify path to vdev dir\n");
|
|
|
|
(void) fprintf(stderr, " -P print numbers in parseable form\n");
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) fprintf(stderr, " -q don't print label contents\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -t <txg> -- highest txg to use when "
|
|
|
|
"searching for uberblocks\n");
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) fprintf(stderr, " -u uberblock\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -U <cachefile_path> -- use alternate "
|
|
|
|
"cachefile\n");
|
2017-04-14 00:28:46 +03:00
|
|
|
(void) fprintf(stderr, " -V do verbatim import\n");
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) fprintf(stderr, " -x <dumpdir> -- "
|
|
|
|
"dump all read blocks into specified directory\n");
|
|
|
|
(void) fprintf(stderr, " -X attempt extreme rewind (does not "
|
|
|
|
"work with dataset)\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) fprintf(stderr, "Specify an option more than once (e.g. -bb) "
|
|
|
|
"to make only that option verbose\n");
|
|
|
|
(void) fprintf(stderr, "Default is to dump everything non-verbosely\n");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
static void
|
|
|
|
dump_debug_buffer(void)
|
|
|
|
{
|
|
|
|
if (dump_opt['G']) {
|
|
|
|
(void) printf("\n");
|
|
|
|
zfs_dbgmsg_print("zdb");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Called for usage errors that are discovered after a call to spa_open(),
|
|
|
|
* dmu_bonus_hold(), or pool_match(). abort() is called for other errors.
|
|
|
|
*/
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
fatal(const char *fmt, ...)
|
|
|
|
{
|
|
|
|
va_list ap;
|
|
|
|
|
|
|
|
va_start(ap, fmt);
|
|
|
|
(void) fprintf(stderr, "%s: ", cmdname);
|
|
|
|
(void) vfprintf(stderr, fmt, ap);
|
|
|
|
va_end(ap);
|
|
|
|
(void) fprintf(stderr, "\n");
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
dump_debug_buffer();
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
exit(1);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
dump_packed_nvlist(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
nvlist_t *nv;
|
|
|
|
size_t nvsize = *(uint64_t *)data;
|
|
|
|
char *packed = umem_alloc(nvsize, UMEM_NOFAIL);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
VERIFY(0 == dmu_read(os, object, 0, nvsize, packed, DMU_READ_PREFETCH));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
VERIFY(nvlist_unpack(packed, nvsize, &nv, 0) == 0);
|
|
|
|
|
|
|
|
umem_free(packed, nvsize);
|
|
|
|
|
|
|
|
dump_nvlist(nv, 8);
|
|
|
|
|
|
|
|
nvlist_free(nv);
|
|
|
|
}
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
dump_history_offsets(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
spa_history_phys_t *shp = data;
|
|
|
|
|
|
|
|
if (shp == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("\t\tpool_create_len = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_pool_create_len);
|
|
|
|
(void) printf("\t\tphys_max_off = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_phys_max_off);
|
|
|
|
(void) printf("\t\tbof = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_bof);
|
|
|
|
(void) printf("\t\teof = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_eof);
|
|
|
|
(void) printf("\t\trecords_lost = %llu\n",
|
|
|
|
(u_longlong_t)shp->sh_records_lost);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(uint64_t num, char *buf, size_t buflen)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
if (dump_opt['P'])
|
2017-06-13 12:16:45 +03:00
|
|
|
(void) snprintf(buf, buflen, "%llu", (longlong_t)num);
|
2010-05-29 00:45:14 +04:00
|
|
|
else
|
2017-06-13 12:16:45 +03:00
|
|
|
nicenum(num, buf, sizeof (buf));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char histo_stars[] = "****************************************";
|
|
|
|
static const uint64_t histo_width = sizeof (histo_stars) - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(const uint64_t *histo, int size, int offset)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int i;
|
2013-03-25 01:24:51 +04:00
|
|
|
int minidx = size - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
int maxidx = 0;
|
|
|
|
uint64_t max = 0;
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
for (i = 0; i < size; i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (histo[i] > max)
|
|
|
|
max = histo[i];
|
|
|
|
if (histo[i] > 0 && i > maxidx)
|
|
|
|
maxidx = i;
|
|
|
|
if (histo[i] > 0 && i < minidx)
|
|
|
|
minidx = i;
|
|
|
|
}
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (max < histo_width)
|
|
|
|
max = histo_width;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
for (i = minidx; i <= maxidx; i++) {
|
|
|
|
(void) printf("\t\t\t%3u: %6llu %s\n",
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
i + offset, (u_longlong_t)histo[i],
|
2013-03-25 01:24:51 +04:00
|
|
|
&histo_stars[(max - histo[i]) * histo_width / max]);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_zap_stats(objset_t *os, uint64_t object)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
zap_stats_t zs;
|
|
|
|
|
|
|
|
error = zap_get_stats(os, object, &zs);
|
|
|
|
if (error)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (zs.zs_ptrtbl_len == 0) {
|
|
|
|
ASSERT(zs.zs_num_blocks == 1);
|
|
|
|
(void) printf("\tmicrozap: %llu bytes, %llu entries\n",
|
|
|
|
(u_longlong_t)zs.zs_blocksize,
|
|
|
|
(u_longlong_t)zs.zs_num_entries);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\tFat ZAP stats:\n");
|
|
|
|
|
|
|
|
(void) printf("\t\tPointer table:\n");
|
|
|
|
(void) printf("\t\t\t%llu elements\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_len);
|
|
|
|
(void) printf("\t\t\tzt_blk: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_zt_blk);
|
|
|
|
(void) printf("\t\t\tzt_numblks: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_zt_numblks);
|
|
|
|
(void) printf("\t\t\tzt_shift: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_zt_shift);
|
|
|
|
(void) printf("\t\t\tzt_blks_copied: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_blks_copied);
|
|
|
|
(void) printf("\t\t\tzt_nextblk: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_ptrtbl_nextblk);
|
|
|
|
|
|
|
|
(void) printf("\t\tZAP entries: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_num_entries);
|
|
|
|
(void) printf("\t\tLeaf blocks: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_num_leafs);
|
|
|
|
(void) printf("\t\tTotal blocks: %llu\n",
|
|
|
|
(u_longlong_t)zs.zs_num_blocks);
|
|
|
|
(void) printf("\t\tzap_block_type: 0x%llx\n",
|
|
|
|
(u_longlong_t)zs.zs_block_type);
|
|
|
|
(void) printf("\t\tzap_magic: 0x%llx\n",
|
|
|
|
(u_longlong_t)zs.zs_magic);
|
|
|
|
(void) printf("\t\tzap_salt: 0x%llx\n",
|
|
|
|
(u_longlong_t)zs.zs_salt);
|
|
|
|
|
|
|
|
(void) printf("\t\tLeafs with 2^n pointers:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_leafs_with_2n_pointers, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tBlocks with n*5 entries:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_blocks_with_n5_entries, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tBlocks n/10 full:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_blocks_n_tenths_full, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tEntries with n chunks:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_entries_using_n_chunks, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tBuckets with n entries:\n");
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(zs.zs_buckets_with_n_entries, ZAP_HISTOGRAM_SIZE, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_none(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_unknown(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
(void) printf("\tUNKNOWN OBJECT TYPE\n");
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
2017-10-27 22:46:35 +03:00
|
|
|
static void
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_uint8(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_uint64(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_zap(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
void *prop;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = ", attr.za_name);
|
|
|
|
if (attr.za_num_integers == 0) {
|
|
|
|
(void) printf("\n");
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
prop = umem_zalloc(attr.za_num_integers *
|
|
|
|
attr.za_integer_length, UMEM_NOFAIL);
|
|
|
|
(void) zap_lookup(os, object, attr.za_name,
|
|
|
|
attr.za_integer_length, attr.za_num_integers, prop);
|
|
|
|
if (attr.za_integer_length == 1) {
|
|
|
|
(void) printf("%s", (char *)prop);
|
|
|
|
} else {
|
|
|
|
for (i = 0; i < attr.za_num_integers; i++) {
|
|
|
|
switch (attr.za_integer_length) {
|
|
|
|
case 2:
|
|
|
|
(void) printf("%u ",
|
|
|
|
((uint16_t *)prop)[i]);
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
(void) printf("%u ",
|
|
|
|
((uint32_t *)prop)[i]);
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
(void) printf("%lld ",
|
|
|
|
(u_longlong_t)((int64_t *)prop)[i]);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
umem_free(prop, attr.za_num_integers * attr.za_integer_length);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
2015-04-27 01:27:36 +03:00
|
|
|
static void
|
|
|
|
dump_bpobj(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
bpobj_phys_t *bpop = data;
|
|
|
|
uint64_t i;
|
|
|
|
char bytes[32], comp[32], uncomp[32];
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure the output won't get truncated */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (comp) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncomp) >= NN_NUMBUF_SZ);
|
|
|
|
|
2015-04-27 01:27:36 +03:00
|
|
|
if (bpop == NULL)
|
|
|
|
return;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bpop->bpo_bytes, bytes, sizeof (bytes));
|
|
|
|
zdb_nicenum(bpop->bpo_comp, comp, sizeof (comp));
|
|
|
|
zdb_nicenum(bpop->bpo_uncomp, uncomp, sizeof (uncomp));
|
2015-04-27 01:27:36 +03:00
|
|
|
|
|
|
|
(void) printf("\t\tnum_blkptrs = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_num_blkptrs);
|
|
|
|
(void) printf("\t\tbytes = %s\n", bytes);
|
|
|
|
if (size >= BPOBJ_SIZE_V1) {
|
|
|
|
(void) printf("\t\tcomp = %s\n", comp);
|
|
|
|
(void) printf("\t\tuncomp = %s\n", uncomp);
|
|
|
|
}
|
|
|
|
if (size >= sizeof (*bpop)) {
|
|
|
|
(void) printf("\t\tsubobjs = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_subobjs);
|
|
|
|
(void) printf("\t\tnum_subobjs = %llu\n",
|
|
|
|
(u_longlong_t)bpop->bpo_num_subobjs);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dump_opt['d'] < 5)
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (i = 0; i < bpop->bpo_num_blkptrs; i++) {
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
blkptr_t bp;
|
|
|
|
|
|
|
|
int err = dmu_read(os, object,
|
|
|
|
i * sizeof (bp), sizeof (bp), &bp, 0);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) printf("got error %u from dmu_read\n", err);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
snprintf_blkptr_compact(blkbuf, sizeof (blkbuf), &bp);
|
|
|
|
(void) printf("\t%s\n", blkbuf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
dump_bpobj_subobjs(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dmu_object_info_t doi;
|
2015-10-09 21:28:12 +03:00
|
|
|
int64_t i;
|
2015-04-27 01:27:36 +03:00
|
|
|
|
|
|
|
VERIFY0(dmu_object_info(os, object, &doi));
|
|
|
|
uint64_t *subobjs = kmem_alloc(doi.doi_max_offset, KM_SLEEP);
|
|
|
|
|
|
|
|
int err = dmu_read(os, object, 0, doi.doi_max_offset, subobjs, 0);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) printf("got error %u from dmu_read\n", err);
|
|
|
|
kmem_free(subobjs, doi.doi_max_offset);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
int64_t last_nonzero = -1;
|
|
|
|
for (i = 0; i < doi.doi_max_offset / 8; i++) {
|
|
|
|
if (subobjs[i] != 0)
|
|
|
|
last_nonzero = i;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i <= last_nonzero; i++) {
|
2015-10-09 21:28:12 +03:00
|
|
|
(void) printf("\t%llu\n", (u_longlong_t)subobjs[i]);
|
2015-04-27 01:27:36 +03:00
|
|
|
}
|
|
|
|
kmem_free(subobjs, doi.doi_max_offset);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_ddt_zap(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
/* contents are printed elsewhere, properly decoded */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_sa_attrs(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = ", attr.za_name);
|
|
|
|
if (attr.za_num_integers == 0) {
|
|
|
|
(void) printf("\n");
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
(void) printf(" %llx : [%d:%d:%d]\n",
|
|
|
|
(u_longlong_t)attr.za_first_integer,
|
|
|
|
(int)ATTR_LENGTH(attr.za_first_integer),
|
|
|
|
(int)ATTR_BSWAP(attr.za_first_integer),
|
|
|
|
(int)ATTR_NUM(attr.za_first_integer));
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_sa_layouts(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
uint16_t *layout_attrs;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = [", attr.za_name);
|
|
|
|
if (attr.za_num_integers == 0) {
|
|
|
|
(void) printf("\n");
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
VERIFY(attr.za_integer_length == 2);
|
|
|
|
layout_attrs = umem_zalloc(attr.za_num_integers *
|
|
|
|
attr.za_integer_length, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
VERIFY(zap_lookup(os, object, attr.za_name,
|
|
|
|
attr.za_integer_length,
|
|
|
|
attr.za_num_integers, layout_attrs) == 0);
|
|
|
|
|
|
|
|
for (i = 0; i != attr.za_num_integers; i++)
|
|
|
|
(void) printf(" %d ", (int)layout_attrs[i]);
|
|
|
|
(void) printf("]\n");
|
|
|
|
umem_free(layout_attrs,
|
|
|
|
attr.za_num_integers * attr.za_integer_length);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_zpldir(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
const char *typenames[] = {
|
|
|
|
/* 0 */ "not specified",
|
|
|
|
/* 1 */ "FIFO",
|
|
|
|
/* 2 */ "Character Device",
|
|
|
|
/* 3 */ "3 (invalid)",
|
|
|
|
/* 4 */ "Directory",
|
|
|
|
/* 5 */ "5 (invalid)",
|
|
|
|
/* 6 */ "Block Device",
|
|
|
|
/* 7 */ "7 (invalid)",
|
|
|
|
/* 8 */ "Regular File",
|
|
|
|
/* 9 */ "9 (invalid)",
|
|
|
|
/* 10 */ "Symbolic Link",
|
|
|
|
/* 11 */ "11 (invalid)",
|
|
|
|
/* 12 */ "Socket",
|
|
|
|
/* 13 */ "Door",
|
|
|
|
/* 14 */ "Event Port",
|
|
|
|
/* 15 */ "15 (invalid)",
|
|
|
|
};
|
|
|
|
|
|
|
|
dump_zap_stats(os, object);
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, os, object);
|
|
|
|
zap_cursor_retrieve(&zc, &attr) == 0;
|
|
|
|
zap_cursor_advance(&zc)) {
|
|
|
|
(void) printf("\t\t%s = %lld (type: %s)\n",
|
|
|
|
attr.za_name, ZFS_DIRENT_OBJ(attr.za_first_integer),
|
|
|
|
typenames[ZFS_DIRENT_TYPE(attr.za_first_integer)]);
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static int
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
get_dtl_refcount(vdev_t *vd)
|
|
|
|
{
|
|
|
|
int refcount = 0;
|
|
|
|
|
|
|
|
if (vd->vdev_ops->vdev_op_leaf) {
|
|
|
|
space_map_t *sm = vd->vdev_dtl_sm;
|
|
|
|
|
|
|
|
if (sm != NULL &&
|
|
|
|
sm->sm_dbuf->db_size == sizeof (space_map_phys_t))
|
|
|
|
return (1);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++)
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
refcount += get_dtl_refcount(vd->vdev_child[c]);
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static int
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
get_metaslab_refcount(vdev_t *vd)
|
|
|
|
{
|
|
|
|
int refcount = 0;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (vd->vdev_top == vd) {
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
space_map_t *sm = vd->vdev_ms[m]->ms_sm;
|
|
|
|
|
|
|
|
if (sm != NULL &&
|
|
|
|
sm->sm_dbuf->db_size == sizeof (space_map_phys_t))
|
|
|
|
refcount++;
|
|
|
|
}
|
|
|
|
}
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++)
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
refcount += get_metaslab_refcount(vd->vdev_child[c]);
|
|
|
|
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static int
|
|
|
|
get_obsolete_refcount(vdev_t *vd)
|
|
|
|
{
|
|
|
|
int refcount = 0;
|
|
|
|
|
|
|
|
uint64_t obsolete_sm_obj = vdev_obsolete_sm_object(vd);
|
|
|
|
if (vd->vdev_top == vd && obsolete_sm_obj != 0) {
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
VERIFY0(dmu_object_info(vd->vdev_spa->spa_meta_objset,
|
|
|
|
obsolete_sm_obj, &doi));
|
|
|
|
if (doi.doi_bonus_size == sizeof (space_map_phys_t)) {
|
|
|
|
refcount++;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ASSERT3P(vd->vdev_obsolete_sm, ==, NULL);
|
|
|
|
ASSERT3U(obsolete_sm_obj, ==, 0);
|
|
|
|
}
|
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++) {
|
|
|
|
refcount += get_obsolete_refcount(vd->vdev_child[c]);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (refcount);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
get_prev_obsolete_spacemap_refcount(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t prev_obj =
|
|
|
|
spa->spa_condensing_indirect_phys.scip_prev_obsolete_sm_object;
|
|
|
|
if (prev_obj != 0) {
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
VERIFY0(dmu_object_info(spa->spa_meta_objset, prev_obj, &doi));
|
|
|
|
if (doi.doi_bonus_size == sizeof (space_map_phys_t)) {
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
static int
|
|
|
|
verify_spacemap_refcounts(spa_t *spa)
|
|
|
|
{
|
2013-10-08 21:13:05 +04:00
|
|
|
uint64_t expected_refcount = 0;
|
|
|
|
uint64_t actual_refcount;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
2013-10-08 21:13:05 +04:00
|
|
|
(void) feature_get_refcount(spa,
|
|
|
|
&spa_feature_table[SPA_FEATURE_SPACEMAP_HISTOGRAM],
|
|
|
|
&expected_refcount);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
actual_refcount = get_dtl_refcount(spa->spa_root_vdev);
|
|
|
|
actual_refcount += get_metaslab_refcount(spa->spa_root_vdev);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
actual_refcount += get_obsolete_refcount(spa->spa_root_vdev);
|
|
|
|
actual_refcount += get_prev_obsolete_spacemap_refcount(spa);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
|
|
|
if (expected_refcount != actual_refcount) {
|
2013-10-08 21:13:05 +04:00
|
|
|
(void) printf("space map refcount mismatch: expected %lld != "
|
|
|
|
"actual %lld\n",
|
|
|
|
(longlong_t)expected_refcount,
|
|
|
|
(longlong_t)actual_refcount);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
return (2);
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_spacemap(objset_t *os, space_map_t *sm)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
uint64_t alloc, offset, entry;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *ddata[] = { "ALLOC", "FREE", "CONDENSE", "INVALID",
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
"INVALID", "INVALID", "INVALID", "INVALID" };
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (sm == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
(void) printf("space map object %llu:\n",
|
|
|
|
(longlong_t)sm->sm_phys->smp_object);
|
|
|
|
(void) printf(" smp_objsize = 0x%llx\n",
|
|
|
|
(longlong_t)sm->sm_phys->smp_objsize);
|
|
|
|
(void) printf(" smp_alloc = 0x%llx\n",
|
|
|
|
(longlong_t)sm->sm_phys->smp_alloc);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Print out the freelist entries in both encoded and decoded form.
|
|
|
|
*/
|
|
|
|
alloc = 0;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
for (offset = 0; offset < space_map_length(sm);
|
|
|
|
offset += sizeof (entry)) {
|
|
|
|
uint8_t mapshift = sm->sm_shift;
|
|
|
|
|
|
|
|
VERIFY0(dmu_read(os, space_map_object(sm), offset,
|
2009-07-03 02:44:48 +04:00
|
|
|
sizeof (entry), &entry, DMU_READ_PREFETCH));
|
2008-11-20 23:01:55 +03:00
|
|
|
if (SM_DEBUG_DECODE(entry)) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\t [%6llu] %s: txg %llu, pass %llu\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)(offset / sizeof (entry)),
|
|
|
|
ddata[SM_DEBUG_ACTION_DECODE(entry)],
|
|
|
|
(u_longlong_t)SM_DEBUG_TXG_DECODE(entry),
|
|
|
|
(u_longlong_t)SM_DEBUG_SYNCPASS_DECODE(entry));
|
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\t [%6llu] %c range:"
|
|
|
|
" %010llx-%010llx size: %06llx\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)(offset / sizeof (entry)),
|
|
|
|
SM_TYPE_DECODE(entry) == SM_ALLOC ? 'A' : 'F',
|
|
|
|
(u_longlong_t)((SM_OFFSET_DECODE(entry) <<
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
mapshift) + sm->sm_start),
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)((SM_OFFSET_DECODE(entry) <<
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
mapshift) + sm->sm_start +
|
|
|
|
(SM_RUN_DECODE(entry) << mapshift)),
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)(SM_RUN_DECODE(entry) << mapshift));
|
|
|
|
if (SM_TYPE_DECODE(entry) == SM_ALLOC)
|
|
|
|
alloc += SM_RUN_DECODE(entry) << mapshift;
|
|
|
|
else
|
|
|
|
alloc -= SM_RUN_DECODE(entry) << mapshift;
|
|
|
|
}
|
|
|
|
}
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (alloc != space_map_allocated(sm)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("space_map_object alloc (%llu) INCONSISTENT "
|
|
|
|
"with space map summary (%llu)\n",
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
(u_longlong_t)space_map_allocated(sm), (u_longlong_t)alloc);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
static void
|
|
|
|
dump_metaslab_stats(metaslab_t *msp)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
char maxbuf[32];
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
range_tree_t *rt = msp->ms_tree;
|
|
|
|
avl_tree_t *t = &msp->ms_size_tree;
|
|
|
|
int free_pct = range_tree_space(rt) * 100 / msp->ms_size;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* max sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (maxbuf) >= NN_NUMBUF_SZ);
|
|
|
|
|
|
|
|
zdb_nicenum(metaslab_block_maxsize(msp), maxbuf, sizeof (maxbuf));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\t %25s %10lu %7s %6s %4s %4d%%\n",
|
2009-07-03 02:44:48 +04:00
|
|
|
"segments", avl_numnodes(t), "maxsize", maxbuf,
|
|
|
|
"freepct", free_pct);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
(void) printf("\tIn-memory histogram:\n");
|
|
|
|
dump_histogram(rt->rt_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_metaslab(metaslab_t *msp)
|
|
|
|
{
|
|
|
|
vdev_t *vd = msp->ms_group->mg_vd;
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
space_map_t *sm = msp->ms_sm;
|
2010-05-29 00:45:14 +04:00
|
|
|
char freebuf[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(msp->ms_size - space_map_allocated(sm), freebuf,
|
|
|
|
sizeof (freebuf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf(
|
2010-05-29 00:45:14 +04:00
|
|
|
"\tmetaslab %6llu offset %12llx spacemap %6llu free %5s\n",
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
(u_longlong_t)msp->ms_id, (u_longlong_t)msp->ms_start,
|
|
|
|
(u_longlong_t)space_map_object(sm), freebuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (dump_opt['m'] > 2 && !dump_opt['L']) {
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_enter(&msp->ms_lock);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
metaslab_load_wait(msp);
|
|
|
|
if (!msp->ms_loaded) {
|
|
|
|
VERIFY0(metaslab_load(msp));
|
|
|
|
range_tree_stat_verify(msp->ms_tree);
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
dump_metaslab_stats(msp);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
metaslab_unload(msp);
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_exit(&msp->ms_lock);
|
|
|
|
}
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (dump_opt['m'] > 1 && sm != NULL &&
|
2013-10-08 21:13:05 +04:00
|
|
|
spa_feature_is_active(spa, SPA_FEATURE_SPACEMAP_HISTOGRAM)) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
/*
|
|
|
|
* The space map histogram represents free space in chunks
|
|
|
|
* of sm_shift (i.e. bucket 0 refers to 2^sm_shift).
|
|
|
|
*/
|
2014-07-20 00:19:24 +04:00
|
|
|
(void) printf("\tOn-disk histogram:\t\tfragmentation %llu\n",
|
|
|
|
(u_longlong_t)msp->ms_fragmentation);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_histogram(sm->sm_phys->smp_histogram,
|
2014-07-20 00:19:24 +04:00
|
|
|
SPACE_MAP_HISTOGRAM_SIZE, sm->sm_shift);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (dump_opt['d'] > 5 || dump_opt['m'] > 3) {
|
|
|
|
ASSERT(msp->ms_size == (1ULL << vd->vdev_ms_shift));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_spacemap(spa->spa_meta_objset, msp->ms_sm);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
print_vdev_metaslab_header(vdev_t *vd)
|
|
|
|
{
|
|
|
|
(void) printf("\tvdev %10llu\n\t%-10s%5llu %-19s %-15s %-10s\n",
|
|
|
|
(u_longlong_t)vd->vdev_id,
|
|
|
|
"metaslabs", (u_longlong_t)vd->vdev_ms_count,
|
|
|
|
"offset", "spacemap", "free");
|
|
|
|
(void) printf("\t%15s %19s %15s %10s\n",
|
|
|
|
"---------------", "-------------------",
|
|
|
|
"---------------", "-------------");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-07-20 00:19:24 +04:00
|
|
|
static void
|
|
|
|
dump_metaslab_groups(spa_t *spa)
|
|
|
|
{
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
|
|
|
metaslab_class_t *mc = spa_normal_class(spa);
|
|
|
|
uint64_t fragmentation;
|
|
|
|
|
|
|
|
metaslab_class_histogram_verify(mc);
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < rvd->vdev_children; c++) {
|
2014-07-20 00:19:24 +04:00
|
|
|
vdev_t *tvd = rvd->vdev_child[c];
|
|
|
|
metaslab_group_t *mg = tvd->vdev_mg;
|
|
|
|
|
|
|
|
if (mg->mg_class != mc)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
metaslab_group_histogram_verify(mg);
|
|
|
|
mg->mg_fragmentation = metaslab_group_fragmentation(mg);
|
|
|
|
|
|
|
|
(void) printf("\tvdev %10llu\t\tmetaslabs%5llu\t\t"
|
|
|
|
"fragmentation",
|
|
|
|
(u_longlong_t)tvd->vdev_id,
|
|
|
|
(u_longlong_t)tvd->vdev_ms_count);
|
|
|
|
if (mg->mg_fragmentation == ZFS_FRAG_INVALID) {
|
|
|
|
(void) printf("%3s\n", "-");
|
|
|
|
} else {
|
|
|
|
(void) printf("%3llu%%\n",
|
|
|
|
(u_longlong_t)mg->mg_fragmentation);
|
|
|
|
}
|
|
|
|
dump_histogram(mg->mg_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\tpool %s\tfragmentation", spa_name(spa));
|
|
|
|
fragmentation = metaslab_class_fragmentation(mc);
|
|
|
|
if (fragmentation == ZFS_FRAG_INVALID)
|
|
|
|
(void) printf("\t%3s\n", "-");
|
|
|
|
else
|
|
|
|
(void) printf("\t%3llu%%\n", (u_longlong_t)fragmentation);
|
|
|
|
dump_histogram(mc->mc_histogram, RANGE_TREE_HISTOGRAM_SIZE, 0);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static void
|
|
|
|
print_vdev_indirect(vdev_t *vd)
|
|
|
|
{
|
|
|
|
vdev_indirect_config_t *vic = &vd->vdev_indirect_config;
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
vdev_indirect_births_t *vib = vd->vdev_indirect_births;
|
|
|
|
|
|
|
|
if (vim == NULL) {
|
|
|
|
ASSERT3P(vib, ==, NULL);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT3U(vdev_indirect_mapping_object(vim), ==,
|
|
|
|
vic->vic_mapping_object);
|
|
|
|
ASSERT3U(vdev_indirect_births_object(vib), ==,
|
|
|
|
vic->vic_births_object);
|
|
|
|
|
|
|
|
(void) printf("indirect births obj %llu:\n",
|
|
|
|
(longlong_t)vic->vic_births_object);
|
|
|
|
(void) printf(" vib_count = %llu\n",
|
|
|
|
(longlong_t)vdev_indirect_births_count(vib));
|
|
|
|
for (uint64_t i = 0; i < vdev_indirect_births_count(vib); i++) {
|
|
|
|
vdev_indirect_birth_entry_phys_t *cur_vibe =
|
|
|
|
&vib->vib_entries[i];
|
|
|
|
(void) printf("\toffset %llx -> txg %llu\n",
|
|
|
|
(longlong_t)cur_vibe->vibe_offset,
|
|
|
|
(longlong_t)cur_vibe->vibe_phys_birth_txg);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
(void) printf("indirect mapping obj %llu:\n",
|
|
|
|
(longlong_t)vic->vic_mapping_object);
|
|
|
|
(void) printf(" vim_max_offset = 0x%llx\n",
|
|
|
|
(longlong_t)vdev_indirect_mapping_max_offset(vim));
|
|
|
|
(void) printf(" vim_bytes_mapped = 0x%llx\n",
|
|
|
|
(longlong_t)vdev_indirect_mapping_bytes_mapped(vim));
|
|
|
|
(void) printf(" vim_count = %llu\n",
|
|
|
|
(longlong_t)vdev_indirect_mapping_num_entries(vim));
|
|
|
|
|
|
|
|
if (dump_opt['d'] <= 5 && dump_opt['m'] <= 3)
|
|
|
|
return;
|
|
|
|
|
|
|
|
uint32_t *counts = vdev_indirect_mapping_load_obsolete_counts(vim);
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < vdev_indirect_mapping_num_entries(vim); i++) {
|
|
|
|
vdev_indirect_mapping_entry_phys_t *vimep =
|
|
|
|
&vim->vim_entries[i];
|
|
|
|
(void) printf("\t<%llx:%llx:%llx> -> "
|
|
|
|
"<%llx:%llx:%llx> (%x obsolete)\n",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)DVA_MAPPING_GET_SRC_OFFSET(vimep),
|
|
|
|
(longlong_t)DVA_GET_ASIZE(&vimep->vimep_dst),
|
|
|
|
(longlong_t)DVA_GET_VDEV(&vimep->vimep_dst),
|
|
|
|
(longlong_t)DVA_GET_OFFSET(&vimep->vimep_dst),
|
|
|
|
(longlong_t)DVA_GET_ASIZE(&vimep->vimep_dst),
|
|
|
|
counts[i]);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
uint64_t obsolete_sm_object = vdev_obsolete_sm_object(vd);
|
|
|
|
if (obsolete_sm_object != 0) {
|
|
|
|
objset_t *mos = vd->vdev_spa->spa_meta_objset;
|
|
|
|
(void) printf("obsolete space map object %llu:\n",
|
|
|
|
(u_longlong_t)obsolete_sm_object);
|
|
|
|
ASSERT(vd->vdev_obsolete_sm != NULL);
|
|
|
|
ASSERT3U(space_map_object(vd->vdev_obsolete_sm), ==,
|
|
|
|
obsolete_sm_object);
|
|
|
|
dump_spacemap(mos, vd->vdev_obsolete_sm);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_metaslabs(spa_t *spa)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *vd, *rvd = spa->spa_root_vdev;
|
|
|
|
uint64_t m, c = 0, children = rvd->vdev_children;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\nMetaslabs:\n");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['d'] && zopt_objects > 0) {
|
|
|
|
c = zopt_object[0];
|
|
|
|
|
|
|
|
if (c >= children)
|
|
|
|
(void) fatal("bad vdev id: %llu", (u_longlong_t)c);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zopt_objects > 1) {
|
|
|
|
vd = rvd->vdev_child[c];
|
|
|
|
print_vdev_metaslab_header(vd);
|
|
|
|
|
|
|
|
for (m = 1; m < zopt_objects; m++) {
|
|
|
|
if (zopt_object[m] < vd->vdev_ms_count)
|
|
|
|
dump_metaslab(
|
|
|
|
vd->vdev_ms[zopt_object[m]]);
|
|
|
|
else
|
|
|
|
(void) fprintf(stderr, "bad metaslab "
|
|
|
|
"number %llu\n",
|
|
|
|
(u_longlong_t)zopt_object[m]);
|
|
|
|
}
|
|
|
|
(void) printf("\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
children = c + 1;
|
|
|
|
}
|
|
|
|
for (; c < children; c++) {
|
|
|
|
vd = rvd->vdev_child[c];
|
|
|
|
print_vdev_metaslab_header(vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
print_vdev_indirect(vd);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
for (m = 0; m < vd->vdev_ms_count; m++)
|
|
|
|
dump_metaslab(vd->vdev_ms[m]);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
dump_dde(const ddt_t *ddt, const ddt_entry_t *dde, uint64_t index)
|
|
|
|
{
|
|
|
|
const ddt_phys_t *ddp = dde->dde_phys;
|
|
|
|
const ddt_key_t *ddk = &dde->dde_key;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *types[4] = { "ditto", "single", "double", "triple" };
|
2010-05-29 00:45:14 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
blkptr_t blk;
|
2010-08-26 20:52:39 +04:00
|
|
|
int p;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ddp->ddp_phys_birth == 0)
|
|
|
|
continue;
|
|
|
|
ddt_bp_create(ddt->ddt_checksum, ddk, ddp, &blk);
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), &blk);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("index %llx refcnt %llu %s %s\n",
|
|
|
|
(u_longlong_t)index, (u_longlong_t)ddp->ddp_refcnt,
|
|
|
|
types[p], blkbuf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_dedup_ratio(const ddt_stat_t *dds)
|
|
|
|
{
|
|
|
|
double rL, rP, rD, D, dedup, compress, copies;
|
|
|
|
|
|
|
|
if (dds->dds_blocks == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
rL = (double)dds->dds_ref_lsize;
|
|
|
|
rP = (double)dds->dds_ref_psize;
|
|
|
|
rD = (double)dds->dds_ref_dsize;
|
|
|
|
D = (double)dds->dds_dsize;
|
|
|
|
|
|
|
|
dedup = rD / D;
|
|
|
|
compress = rL / rP;
|
|
|
|
copies = rD / rP;
|
|
|
|
|
|
|
|
(void) printf("dedup = %.2f, compress = %.2f, copies = %.2f, "
|
|
|
|
"dedup * compress / copies = %.2f\n\n",
|
|
|
|
dedup, compress, copies, dedup * compress / copies);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_ddt(ddt_t *ddt, enum ddt_type type, enum ddt_class class)
|
|
|
|
{
|
|
|
|
char name[DDT_NAMELEN];
|
|
|
|
ddt_entry_t dde;
|
|
|
|
uint64_t walk = 0;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
uint64_t count, dspace, mspace;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = ddt_object_info(ddt, type, class, &doi);
|
|
|
|
|
|
|
|
if (error == ENOENT)
|
|
|
|
return;
|
|
|
|
ASSERT(error == 0);
|
|
|
|
|
2012-10-26 21:01:49 +04:00
|
|
|
error = ddt_object_count(ddt, type, class, &count);
|
|
|
|
ASSERT(error == 0);
|
|
|
|
if (count == 0)
|
2010-08-27 01:24:34 +04:00
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dspace = doi.doi_physical_blocks_512 << 9;
|
|
|
|
mspace = doi.doi_fill_count * doi.doi_data_block_size;
|
|
|
|
|
|
|
|
ddt_object_name(ddt, type, class, name);
|
|
|
|
|
|
|
|
(void) printf("%s: %llu entries, size %llu on disk, %llu in core\n",
|
|
|
|
name,
|
|
|
|
(u_longlong_t)count,
|
|
|
|
(u_longlong_t)(dspace / count),
|
|
|
|
(u_longlong_t)(mspace / count));
|
|
|
|
|
|
|
|
if (dump_opt['D'] < 3)
|
|
|
|
return;
|
|
|
|
|
|
|
|
zpool_dump_ddt(NULL, &ddt->ddt_histogram[type][class]);
|
|
|
|
|
|
|
|
if (dump_opt['D'] < 4)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (dump_opt['D'] < 5 && class == DDT_CLASS_UNIQUE)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("%s contents:\n\n", name);
|
|
|
|
|
|
|
|
while ((error = ddt_object_walk(ddt, type, class, &walk, &dde)) == 0)
|
|
|
|
dump_dde(ddt, &dde, walk);
|
|
|
|
|
|
|
|
ASSERT(error == ENOENT);
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_all_ddts(spa_t *spa)
|
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
ddt_histogram_t ddh_total;
|
|
|
|
ddt_stat_t dds_total;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&ddh_total, sizeof (ddh_total));
|
|
|
|
bzero(&dds_total, sizeof (dds_total));
|
2010-08-26 20:52:39 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (enum zio_checksum c = 0; c < ZIO_CHECKSUM_FUNCTIONS; c++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_t *ddt = spa->spa_ddt[c];
|
2017-10-27 22:46:35 +03:00
|
|
|
for (enum ddt_type type = 0; type < DDT_TYPES; type++) {
|
|
|
|
for (enum ddt_class class = 0; class < DDT_CLASSES;
|
2010-05-29 00:45:14 +04:00
|
|
|
class++) {
|
|
|
|
dump_ddt(ddt, type, class);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ddt_get_dedup_stats(spa, &dds_total);
|
|
|
|
|
|
|
|
if (dds_total.dds_blocks == 0) {
|
|
|
|
(void) printf("All DDTs are empty\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
if (dump_opt['D'] > 1) {
|
|
|
|
(void) printf("DDT histogram (aggregated over all DDTs):\n");
|
|
|
|
ddt_get_dedup_histogram(spa, &ddh_total);
|
|
|
|
zpool_dump_ddt(&dds_total, &ddh_total);
|
|
|
|
}
|
|
|
|
|
|
|
|
dump_dedup_ratio(&dds_total);
|
|
|
|
}
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
dump_dtl_seg(void *arg, uint64_t start, uint64_t size)
|
2009-01-16 00:59:39 +03:00
|
|
|
{
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
char *prefix = arg;
|
2009-01-16 00:59:39 +03:00
|
|
|
|
|
|
|
(void) printf("%s [%llu,%llu) length %llu\n",
|
|
|
|
prefix,
|
|
|
|
(u_longlong_t)start,
|
|
|
|
(u_longlong_t)(start + size),
|
|
|
|
(u_longlong_t)(size));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_dtl(vdev_t *vd, int indent)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
boolean_t required;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *name[DTL_TYPES] = { "missing", "partial", "scrub",
|
|
|
|
"outage" };
|
2009-01-16 00:59:39 +03:00
|
|
|
char prefix[256];
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_vdev_state_enter(spa, SCL_NONE);
|
2009-01-16 00:59:39 +03:00
|
|
|
required = vdev_dtl_required(vd);
|
|
|
|
(void) spa_vdev_state_exit(spa, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (indent == 0)
|
|
|
|
(void) printf("\nDirty time logs:\n\n");
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) printf("\t%*s%s [%s]\n", indent, "",
|
2008-12-03 23:09:06 +03:00
|
|
|
vd->vdev_path ? vd->vdev_path :
|
2009-01-16 00:59:39 +03:00
|
|
|
vd->vdev_parent ? vd->vdev_ops->vdev_op_type : spa_name(spa),
|
|
|
|
required ? "DTL-required" : "DTL-expendable");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (int t = 0; t < DTL_TYPES; t++) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
range_tree_t *rt = vd->vdev_dtl[t];
|
|
|
|
if (range_tree_space(rt) == 0)
|
2009-01-16 00:59:39 +03:00
|
|
|
continue;
|
|
|
|
(void) snprintf(prefix, sizeof (prefix), "\t%*s%s",
|
|
|
|
indent + 2, "", name[t]);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
range_tree_walk(rt, dump_dtl_seg, prefix);
|
2009-01-16 00:59:39 +03:00
|
|
|
if (dump_opt['d'] > 5 && vd->vdev_children == 0)
|
|
|
|
dump_spacemap(spa->spa_meta_objset,
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
vd->vdev_dtl_sm);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < vd->vdev_children; c++)
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_dtl(vd->vdev_child[c], indent + 4);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
dump_history(spa_t *spa)
|
|
|
|
{
|
|
|
|
nvlist_t **events = NULL;
|
2015-06-24 21:17:36 +03:00
|
|
|
char *buf;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t resid, len, off = 0;
|
|
|
|
uint_t num = 0;
|
|
|
|
int error;
|
|
|
|
time_t tsec;
|
|
|
|
struct tm t;
|
|
|
|
char tbuf[30];
|
|
|
|
char internalstr[MAXPATHLEN];
|
|
|
|
|
2015-06-24 21:17:36 +03:00
|
|
|
if ((buf = malloc(SPA_OLD_MAXBLOCKSIZE)) == NULL) {
|
|
|
|
(void) fprintf(stderr, "%s: unable to allocate I/O buffer\n",
|
|
|
|
__func__);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
do {
|
2015-06-24 21:17:36 +03:00
|
|
|
len = SPA_OLD_MAXBLOCKSIZE;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if ((error = spa_history_get(spa, &off, &len, buf)) != 0) {
|
|
|
|
(void) fprintf(stderr, "Unable to read history: "
|
|
|
|
"error %d\n", error);
|
2015-06-24 21:17:36 +03:00
|
|
|
free(buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zpool_history_unpack(buf, len, &resid, &events, &num) != 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
off -= resid;
|
|
|
|
} while (len != 0);
|
|
|
|
|
|
|
|
(void) printf("\nHistory:\n");
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned i = 0; i < num; i++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t time, txg, ievent;
|
|
|
|
char *cmd, *intstr;
|
2013-08-28 15:45:09 +04:00
|
|
|
boolean_t printed = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (nvlist_lookup_uint64(events[i], ZPOOL_HIST_TIME,
|
|
|
|
&time) != 0)
|
2013-08-28 15:45:09 +04:00
|
|
|
goto next;
|
2010-05-29 00:45:14 +04:00
|
|
|
if (nvlist_lookup_string(events[i], ZPOOL_HIST_CMD,
|
|
|
|
&cmd) != 0) {
|
|
|
|
if (nvlist_lookup_uint64(events[i],
|
|
|
|
ZPOOL_HIST_INT_EVENT, &ievent) != 0)
|
2013-08-28 15:45:09 +04:00
|
|
|
goto next;
|
2010-05-29 00:45:14 +04:00
|
|
|
verify(nvlist_lookup_uint64(events[i],
|
|
|
|
ZPOOL_HIST_TXG, &txg) == 0);
|
|
|
|
verify(nvlist_lookup_string(events[i],
|
|
|
|
ZPOOL_HIST_INT_STR, &intstr) == 0);
|
2013-08-28 15:45:09 +04:00
|
|
|
if (ievent >= ZFS_NUM_LEGACY_HISTORY_EVENTS)
|
|
|
|
goto next;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) snprintf(internalstr,
|
|
|
|
sizeof (internalstr),
|
|
|
|
"[internal %s txg:%lld] %s",
|
2010-08-26 20:52:39 +04:00
|
|
|
zfs_history_event_names[ievent],
|
|
|
|
(longlong_t)txg, intstr);
|
2010-05-29 00:45:14 +04:00
|
|
|
cmd = internalstr;
|
|
|
|
}
|
|
|
|
tsec = time;
|
|
|
|
(void) localtime_r(&tsec, &t);
|
|
|
|
(void) strftime(tbuf, sizeof (tbuf), "%F.%T", &t);
|
|
|
|
(void) printf("%s %s\n", tbuf, cmd);
|
2013-08-28 15:45:09 +04:00
|
|
|
printed = B_TRUE;
|
|
|
|
|
|
|
|
next:
|
|
|
|
if (dump_opt['h'] > 1) {
|
|
|
|
if (!printed)
|
|
|
|
(void) printf("unrecognized record:\n");
|
|
|
|
dump_nvlist(events[i], 2);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2015-06-24 21:17:36 +03:00
|
|
|
free(buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dnode(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
2014-06-25 22:37:59 +04:00
|
|
|
blkid2offset(const dnode_phys_t *dnp, const blkptr_t *bp,
|
|
|
|
const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dnp == NULL) {
|
|
|
|
ASSERT(zb->zb_level < 0);
|
|
|
|
if (zb->zb_object == 0)
|
|
|
|
return (zb->zb_blkid);
|
|
|
|
return (zb->zb_blkid * BP_GET_LSIZE(bp));
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(zb->zb_level >= 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return ((zb->zb_blkid <<
|
|
|
|
(zb->zb_level * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT))) *
|
2008-11-20 23:01:55 +03:00
|
|
|
dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr_compact(char *blkbuf, size_t buflen, const blkptr_t *bp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
const dva_t *dva = bp->blk_dva;
|
|
|
|
int ndvas = dump_opt['d'] > 5 ? BP_GET_NDVAS(bp) : 1;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (dump_opt['b'] >= 6) {
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, buflen, bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_EMBEDDED(bp)) {
|
|
|
|
(void) sprintf(blkbuf,
|
|
|
|
"EMBEDDED et=%u %llxL/%llxP B=%llu",
|
|
|
|
(int)BPE_GET_ETYPE(bp),
|
|
|
|
(u_longlong_t)BPE_GET_LSIZE(bp),
|
|
|
|
(u_longlong_t)BPE_GET_PSIZE(bp),
|
|
|
|
(u_longlong_t)bp->blk_birth);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
blkbuf[0] = '\0';
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < ndvas; i++)
|
2013-12-09 22:37:51 +04:00
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
|
|
|
buflen - strlen(blkbuf), "%llu:%llx:%llx ",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)DVA_GET_VDEV(&dva[i]),
|
|
|
|
(u_longlong_t)DVA_GET_OFFSET(&dva[i]),
|
|
|
|
(u_longlong_t)DVA_GET_ASIZE(&dva[i]));
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_IS_HOLE(bp)) {
|
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
2015-03-27 05:03:22 +03:00
|
|
|
buflen - strlen(blkbuf),
|
|
|
|
"%llxL B=%llu",
|
|
|
|
(u_longlong_t)BP_GET_LSIZE(bp),
|
2013-12-09 22:37:51 +04:00
|
|
|
(u_longlong_t)bp->blk_birth);
|
|
|
|
} else {
|
|
|
|
(void) snprintf(blkbuf + strlen(blkbuf),
|
|
|
|
buflen - strlen(blkbuf),
|
|
|
|
"%llxL/%llxP F=%llu B=%llu/%llu",
|
|
|
|
(u_longlong_t)BP_GET_LSIZE(bp),
|
|
|
|
(u_longlong_t)BP_GET_PSIZE(bp),
|
2014-06-06 01:19:08 +04:00
|
|
|
(u_longlong_t)BP_GET_FILL(bp),
|
2013-12-09 22:37:51 +04:00
|
|
|
(u_longlong_t)bp->blk_birth,
|
|
|
|
(u_longlong_t)BP_PHYSICAL_BIRTH(bp));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
static void
|
2014-06-25 22:37:59 +04:00
|
|
|
print_indirect(blkptr_t *bp, const zbookmark_phys_t *zb,
|
2008-12-03 23:09:06 +03:00
|
|
|
const dnode_phys_t *dnp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2008-11-20 23:01:55 +03:00
|
|
|
int l;
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
ASSERT3U(BP_GET_TYPE(bp), ==, dnp->dn_type);
|
|
|
|
ASSERT3U(BP_GET_LEVEL(bp), ==, zb->zb_level);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%16llx ", (u_longlong_t)blkid2offset(dnp, bp, zb));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(zb->zb_level >= 0);
|
|
|
|
|
|
|
|
for (l = dnp->dn_nlevels - 1; l >= -1; l--) {
|
|
|
|
if (l == zb->zb_level) {
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("L%llx", (u_longlong_t)zb->zb_level);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf(" ");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr_compact(blkbuf, sizeof (blkbuf), bp);
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("%s\n", blkbuf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
visit_indirect(spa_t *spa, const dnode_phys_t *dnp,
|
2014-06-25 22:37:59 +04:00
|
|
|
blkptr_t *bp, const zbookmark_phys_t *zb)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
int err = 0;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
if (bp->blk_birth == 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
print_indirect(bp, zb, dnp);
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_GET_LEVEL(bp) > 0 && !BP_IS_HOLE(bp)) {
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_flags_t flags = ARC_FLAG_WAIT;
|
2008-12-03 23:09:06 +03:00
|
|
|
int i;
|
|
|
|
blkptr_t *cbp;
|
|
|
|
int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
|
|
|
|
arc_buf_t *buf;
|
|
|
|
uint64_t fill = 0;
|
|
|
|
|
2013-07-03 00:26:24 +04:00
|
|
|
err = arc_read(NULL, spa, bp, arc_getbuf_func, &buf,
|
2008-12-03 23:09:06 +03:00
|
|
|
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
|
|
|
|
if (err)
|
|
|
|
return (err);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(buf->b_data);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/* recursively visit blocks below this */
|
|
|
|
cbp = buf->b_data;
|
|
|
|
for (i = 0; i < epb; i++, cbp++) {
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t czb;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
|
|
|
|
zb->zb_level - 1,
|
|
|
|
zb->zb_blkid * epb + i);
|
|
|
|
err = visit_indirect(spa, dnp, cbp, &czb);
|
|
|
|
if (err)
|
|
|
|
break;
|
2014-06-06 01:19:08 +04:00
|
|
|
fill += BP_GET_FILL(cbp);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!err)
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(fill, ==, BP_GET_FILL(bp));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(buf, &buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_indirect(dnode_t *dn)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
dnode_phys_t *dnp = dn->dn_phys;
|
|
|
|
int j;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t czb;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("Indirect blocks:\n");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
SET_BOOKMARK(&czb, dmu_objset_id(dn->dn_objset),
|
2008-12-03 23:09:06 +03:00
|
|
|
dn->dn_object, dnp->dn_nlevels - 1, 0);
|
|
|
|
for (j = 0; j < dnp->dn_nblkptr; j++) {
|
|
|
|
czb.zb_blkid = j;
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) visit_indirect(dmu_objset_spa(dn->dn_objset), dnp,
|
2008-12-03 23:09:06 +03:00
|
|
|
&dnp->dn_blkptr[j], &czb);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dsl_dir(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dsl_dir_phys_t *dd = data;
|
|
|
|
time_t crtime;
|
2010-05-29 00:45:14 +04:00
|
|
|
char nice[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (nice) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dd == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT3U(size, >=, sizeof (dsl_dir_phys_t));
|
|
|
|
|
|
|
|
crtime = dd->dd_creation_time;
|
|
|
|
(void) printf("\t\tcreation_time = %s", ctime(&crtime));
|
|
|
|
(void) printf("\t\thead_dataset_obj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_head_dataset_obj);
|
|
|
|
(void) printf("\t\tparent_dir_obj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_parent_obj);
|
|
|
|
(void) printf("\t\torigin_obj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_origin_obj);
|
|
|
|
(void) printf("\t\tchild_dir_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_child_dir_zapobj);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_used_bytes, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tused_bytes = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_compressed_bytes, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tcompressed_bytes = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_uncompressed_bytes, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tuncompressed_bytes = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_quota, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tquota = %s\n", nice);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_reserved, nice, sizeof (nice));
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\treserved = %s\n", nice);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tprops_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_props_zapobj);
|
|
|
|
(void) printf("\t\tdeleg_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)dd->dd_deleg_zapobj);
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tflags = %llx\n",
|
|
|
|
(u_longlong_t)dd->dd_flags);
|
|
|
|
|
|
|
|
#define DO(which) \
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dd->dd_used_breakdown[DD_USED_ ## which], nice, \
|
|
|
|
sizeof (nice)); \
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tused_breakdown[" #which "] = %s\n", nice)
|
|
|
|
DO(HEAD);
|
|
|
|
DO(SNAP);
|
|
|
|
DO(CHILD);
|
|
|
|
DO(CHILD_RSRV);
|
|
|
|
DO(REFRSRV);
|
|
|
|
#undef DO
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dsl_dataset(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
dsl_dataset_phys_t *ds = data;
|
|
|
|
time_t crtime;
|
2010-05-29 00:45:14 +04:00
|
|
|
char used[32], compressed[32], uncompressed[32], unique[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (used) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (compressed) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncompressed) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (unique) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ds == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(size == sizeof (*ds));
|
|
|
|
crtime = ds->ds_creation_time;
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(ds->ds_referenced_bytes, used, sizeof (used));
|
|
|
|
zdb_nicenum(ds->ds_compressed_bytes, compressed, sizeof (compressed));
|
|
|
|
zdb_nicenum(ds->ds_uncompressed_bytes, uncompressed,
|
|
|
|
sizeof (uncompressed));
|
|
|
|
zdb_nicenum(ds->ds_unique_bytes, unique, sizeof (unique));
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), &ds->ds_bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tdir_obj = %llu\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)ds->ds_dir_obj);
|
|
|
|
(void) printf("\t\tprev_snap_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_prev_snap_obj);
|
|
|
|
(void) printf("\t\tprev_snap_txg = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_prev_snap_txg);
|
|
|
|
(void) printf("\t\tnext_snap_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_next_snap_obj);
|
|
|
|
(void) printf("\t\tsnapnames_zapobj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_snapnames_zapobj);
|
|
|
|
(void) printf("\t\tnum_children = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_num_children);
|
2009-08-18 22:43:27 +04:00
|
|
|
(void) printf("\t\tuserrefs_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_userrefs_obj);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tcreation_time = %s", ctime(&crtime));
|
|
|
|
(void) printf("\t\tcreation_txg = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_creation_txg);
|
|
|
|
(void) printf("\t\tdeadlist_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_deadlist_obj);
|
|
|
|
(void) printf("\t\tused_bytes = %s\n", used);
|
|
|
|
(void) printf("\t\tcompressed_bytes = %s\n", compressed);
|
|
|
|
(void) printf("\t\tuncompressed_bytes = %s\n", uncompressed);
|
|
|
|
(void) printf("\t\tunique = %s\n", unique);
|
|
|
|
(void) printf("\t\tfsid_guid = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_fsid_guid);
|
|
|
|
(void) printf("\t\tguid = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_guid);
|
|
|
|
(void) printf("\t\tflags = %llx\n",
|
|
|
|
(u_longlong_t)ds->ds_flags);
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) printf("\t\tnext_clones_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_next_clones_obj);
|
|
|
|
(void) printf("\t\tprops_obj = %llu\n",
|
|
|
|
(u_longlong_t)ds->ds_props_obj);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tbp = %s\n", blkbuf);
|
|
|
|
}
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
dump_bptree_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
|
|
|
if (bp->blk_birth != 0) {
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2012-12-14 03:24:15 +04:00
|
|
|
(void) printf("\t%s\n", blkbuf);
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2017-10-27 22:46:35 +03:00
|
|
|
dump_bptree(objset_t *os, uint64_t obj, const char *name)
|
2012-12-14 03:24:15 +04:00
|
|
|
{
|
|
|
|
char bytes[32];
|
|
|
|
bptree_phys_t *bt;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
if (dump_opt['d'] < 3)
|
|
|
|
return;
|
|
|
|
|
|
|
|
VERIFY3U(0, ==, dmu_bonus_hold(os, obj, FTAG, &db));
|
|
|
|
bt = db->db_data;
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bt->bt_bytes, bytes, sizeof (bytes));
|
2012-12-14 03:24:15 +04:00
|
|
|
(void) printf("\n %s: %llu datasets, %s\n",
|
|
|
|
name, (unsigned long long)(bt->bt_end - bt->bt_begin), bytes);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
if (dump_opt['d'] < 5)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
(void) bptree_iterate(os, obj, B_FALSE, dump_bptree_cb, NULL, NULL);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
dump_bpobj_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
|
|
|
ASSERT(bp->blk_birth != 0);
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr_compact(blkbuf, sizeof (blkbuf), bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\t%s\n", blkbuf);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2017-10-27 22:46:35 +03:00
|
|
|
dump_full_bpobj(bpobj_t *bpo, const char *name, int indent)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
char bytes[32];
|
|
|
|
char comp[32];
|
|
|
|
char uncomp[32];
|
2013-07-05 23:37:16 +04:00
|
|
|
uint64_t i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (comp) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncomp) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['d'] < 3)
|
|
|
|
return;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bpo->bpo_phys->bpo_bytes, bytes, sizeof (bytes));
|
2013-07-05 23:37:16 +04:00
|
|
|
if (bpo->bpo_havesubobj && bpo->bpo_phys->bpo_subobjs != 0) {
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(bpo->bpo_phys->bpo_comp, comp, sizeof (comp));
|
|
|
|
zdb_nicenum(bpo->bpo_phys->bpo_uncomp, uncomp, sizeof (uncomp));
|
2013-07-05 23:37:16 +04:00
|
|
|
(void) printf(" %*s: object %llu, %llu local blkptrs, "
|
2015-04-27 01:27:36 +03:00
|
|
|
"%llu subobjs in object, %llu, %s (%s/%s comp)\n",
|
2013-07-05 23:37:16 +04:00
|
|
|
indent * 8, name,
|
|
|
|
(u_longlong_t)bpo->bpo_object,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_blkptrs,
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_subobjs,
|
2015-04-27 01:27:36 +03:00
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_subobjs,
|
2008-11-20 23:01:55 +03:00
|
|
|
bytes, comp, uncomp);
|
2013-07-05 23:37:16 +04:00
|
|
|
|
|
|
|
for (i = 0; i < bpo->bpo_phys->bpo_num_subobjs; i++) {
|
|
|
|
uint64_t subobj;
|
|
|
|
bpobj_t subbpo;
|
|
|
|
int error;
|
|
|
|
VERIFY0(dmu_read(bpo->bpo_os,
|
|
|
|
bpo->bpo_phys->bpo_subobjs,
|
|
|
|
i * sizeof (subobj), sizeof (subobj), &subobj, 0));
|
|
|
|
error = bpobj_open(&subbpo, bpo->bpo_os, subobj);
|
|
|
|
if (error != 0) {
|
|
|
|
(void) printf("ERROR %u while trying to open "
|
|
|
|
"subobj id %llu\n",
|
|
|
|
error, (u_longlong_t)subobj);
|
|
|
|
continue;
|
|
|
|
}
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&subbpo, "subobj", indent + 1);
|
2015-12-31 18:57:11 +03:00
|
|
|
bpobj_close(&subbpo);
|
2013-07-05 23:37:16 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2013-07-05 23:37:16 +04:00
|
|
|
(void) printf(" %*s: object %llu, %llu blkptrs, %s\n",
|
|
|
|
indent * 8, name,
|
|
|
|
(u_longlong_t)bpo->bpo_object,
|
|
|
|
(u_longlong_t)bpo->bpo_phys->bpo_num_blkptrs,
|
|
|
|
bytes);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['d'] < 5)
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
|
2013-07-05 23:37:16 +04:00
|
|
|
if (indent == 0) {
|
|
|
|
(void) bpobj_iterate_nofree(bpo, dump_bpobj_cb, NULL, NULL);
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
dump_deadlist(dsl_deadlist_t *dl)
|
|
|
|
{
|
|
|
|
dsl_deadlist_entry_t *dle;
|
2013-07-05 23:37:16 +04:00
|
|
|
uint64_t unused;
|
2010-05-29 00:45:14 +04:00
|
|
|
char bytes[32];
|
|
|
|
char comp[32];
|
|
|
|
char uncomp[32];
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (bytes) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (comp) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (uncomp) >= NN_NUMBUF_SZ);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['d'] < 3)
|
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-09-17 11:14:39 +04:00
|
|
|
if (dl->dl_oldfmt) {
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&dl->dl_bpobj, "old-format deadlist", 0);
|
2014-09-17 11:14:39 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(dl->dl_phys->dl_used, bytes, sizeof (bytes));
|
|
|
|
zdb_nicenum(dl->dl_phys->dl_comp, comp, sizeof (comp));
|
|
|
|
zdb_nicenum(dl->dl_phys->dl_uncomp, uncomp, sizeof (uncomp));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\n Deadlist: %s (%s/%s comp)\n",
|
|
|
|
bytes, comp, uncomp);
|
|
|
|
|
|
|
|
if (dump_opt['d'] < 4)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
2013-07-05 23:37:16 +04:00
|
|
|
/* force the tree to be loaded */
|
|
|
|
dsl_deadlist_space_range(dl, 0, UINT64_MAX, &unused, &unused, &unused);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
for (dle = avl_first(&dl->dl_tree); dle;
|
|
|
|
dle = AVL_NEXT(&dl->dl_tree, dle)) {
|
2013-07-05 23:37:16 +04:00
|
|
|
if (dump_opt['d'] >= 5) {
|
|
|
|
char buf[128];
|
2013-11-01 23:26:11 +04:00
|
|
|
(void) snprintf(buf, sizeof (buf),
|
|
|
|
"mintxg %llu -> obj %llu",
|
2013-07-05 23:37:16 +04:00
|
|
|
(longlong_t)dle->dle_mintxg,
|
|
|
|
(longlong_t)dle->dle_bpobj.bpo_object);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&dle->dle_bpobj, buf, 0);
|
2013-07-05 23:37:16 +04:00
|
|
|
} else {
|
|
|
|
(void) printf("mintxg %llu -> obj %llu\n",
|
|
|
|
(longlong_t)dle->dle_mintxg,
|
|
|
|
(longlong_t)dle->dle_bpobj.bpo_object);
|
|
|
|
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static avl_tree_t idx_tree;
|
|
|
|
static avl_tree_t domain_tree;
|
|
|
|
static boolean_t fuid_table_loaded;
|
2017-04-13 19:40:56 +03:00
|
|
|
static objset_t *sa_os = NULL;
|
|
|
|
static sa_attr_type_t *sa_attr_table = NULL;
|
|
|
|
|
|
|
|
static int
|
|
|
|
open_objset(const char *path, dmu_objset_type_t type, void *tag, objset_t **osp)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
uint64_t sa_attrs = 0;
|
|
|
|
uint64_t version = 0;
|
|
|
|
|
|
|
|
VERIFY3P(sa_os, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = dmu_objset_own(path, type, B_TRUE, B_FALSE, tag, osp);
|
2017-04-13 19:40:56 +03:00
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "failed to own dataset '%s': %s\n", path,
|
|
|
|
strerror(err));
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (dmu_objset_type(*osp) == DMU_OST_ZFS && !(*osp)->os_encrypted) {
|
2017-04-13 19:40:56 +03:00
|
|
|
(void) zap_lookup(*osp, MASTER_NODE_OBJ, ZPL_VERSION_STR,
|
|
|
|
8, 1, &version);
|
|
|
|
if (version >= ZPL_VERSION_SA) {
|
|
|
|
(void) zap_lookup(*osp, MASTER_NODE_OBJ, ZFS_SA_ATTRS,
|
|
|
|
8, 1, &sa_attrs);
|
|
|
|
}
|
|
|
|
err = sa_setup(*osp, sa_attrs, zfs_attr_table, ZPL_END,
|
|
|
|
&sa_attr_table);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "sa_setup failed: %s\n",
|
|
|
|
strerror(err));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(*osp, B_FALSE, tag);
|
2017-04-13 19:40:56 +03:00
|
|
|
*osp = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
sa_os = *osp;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
close_objset(objset_t *os, void *tag)
|
|
|
|
{
|
|
|
|
VERIFY3P(os, ==, sa_os);
|
|
|
|
if (os->os_sa != NULL)
|
|
|
|
sa_tear_down(os);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_FALSE, tag);
|
2017-04-13 19:40:56 +03:00
|
|
|
sa_attr_table = NULL;
|
|
|
|
sa_os = NULL;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
2010-08-26 20:52:41 +04:00
|
|
|
fuid_table_destroy(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
if (fuid_table_loaded) {
|
|
|
|
zfs_fuid_table_destroy(&idx_tree, &domain_tree);
|
|
|
|
fuid_table_loaded = B_FALSE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* print uid or gid information.
|
|
|
|
* For normal POSIX id just the id is printed in decimal format.
|
|
|
|
* For CIFS files with FUID the fuid is printed in hex followed by
|
2013-07-05 23:37:16 +04:00
|
|
|
* the domain-rid string.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
print_idstr(uint64_t id, const char *id_type)
|
|
|
|
{
|
|
|
|
if (FUID_INDEX(id)) {
|
|
|
|
char *domain;
|
|
|
|
|
|
|
|
domain = zfs_fuid_idx_domain(&idx_tree, FUID_INDEX(id));
|
|
|
|
(void) printf("\t%s %llx [%s-%d]\n", id_type,
|
|
|
|
(u_longlong_t)id, domain, (int)FUID_RID(id));
|
|
|
|
} else {
|
|
|
|
(void) printf("\t%s %llu\n", id_type, (u_longlong_t)id);
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uidgid(objset_t *os, uint64_t uid, uint64_t gid)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
uint32_t uid_idx, gid_idx;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
uid_idx = FUID_INDEX(uid);
|
|
|
|
gid_idx = FUID_INDEX(gid);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* Load domain table, if not already loaded */
|
|
|
|
if (!fuid_table_loaded && (uid_idx || gid_idx)) {
|
|
|
|
uint64_t fuid_obj;
|
|
|
|
|
|
|
|
/* first find the fuid object. It lives in the master node */
|
|
|
|
VERIFY(zap_lookup(os, MASTER_NODE_OBJ, ZFS_FUID_TABLES,
|
|
|
|
8, 1, &fuid_obj) == 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
zfs_fuid_avl_tree_create(&idx_tree, &domain_tree);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) zfs_fuid_table_load(os, fuid_obj,
|
|
|
|
&idx_tree, &domain_tree);
|
|
|
|
fuid_table_loaded = B_TRUE;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
print_idstr(uid, "uid");
|
|
|
|
print_idstr(gid, "gid");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-07-09 16:15:26 +04:00
|
|
|
static void
|
|
|
|
dump_znode_sa_xattr(sa_handle_t *hdl)
|
|
|
|
{
|
|
|
|
nvlist_t *sa_xattr;
|
|
|
|
nvpair_t *elem = NULL;
|
|
|
|
int sa_xattr_size = 0;
|
|
|
|
int sa_xattr_entries = 0;
|
|
|
|
int error;
|
|
|
|
char *sa_xattr_packed;
|
|
|
|
|
|
|
|
error = sa_size(hdl, sa_attr_table[ZPL_DXATTR], &sa_xattr_size);
|
|
|
|
if (error || sa_xattr_size == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
sa_xattr_packed = malloc(sa_xattr_size);
|
|
|
|
if (sa_xattr_packed == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
error = sa_lookup(hdl, sa_attr_table[ZPL_DXATTR],
|
|
|
|
sa_xattr_packed, sa_xattr_size);
|
|
|
|
if (error) {
|
|
|
|
free(sa_xattr_packed);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = nvlist_unpack(sa_xattr_packed, sa_xattr_size, &sa_xattr, 0);
|
|
|
|
if (error) {
|
|
|
|
free(sa_xattr_packed);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
while ((elem = nvlist_next_nvpair(sa_xattr, elem)) != NULL)
|
|
|
|
sa_xattr_entries++;
|
|
|
|
|
|
|
|
(void) printf("\tSA xattrs: %d bytes, %d entries\n\n",
|
|
|
|
sa_xattr_size, sa_xattr_entries);
|
|
|
|
while ((elem = nvlist_next_nvpair(sa_xattr, elem)) != NULL) {
|
|
|
|
uchar_t *value;
|
|
|
|
uint_t cnt, idx;
|
|
|
|
|
|
|
|
(void) printf("\t\t%s = ", nvpair_name(elem));
|
|
|
|
nvpair_value_byte_array(elem, &value, &cnt);
|
2013-11-01 23:26:11 +04:00
|
|
|
for (idx = 0; idx < cnt; ++idx) {
|
2013-07-09 16:15:26 +04:00
|
|
|
if (isprint(value[idx]))
|
|
|
|
(void) putchar(value[idx]);
|
|
|
|
else
|
|
|
|
(void) printf("\\%3.3o", value[idx]);
|
|
|
|
}
|
|
|
|
(void) putchar('\n');
|
|
|
|
}
|
|
|
|
|
|
|
|
nvlist_free(sa_xattr);
|
|
|
|
free(sa_xattr_packed);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_znode(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
char path[MAXPATHLEN * 2]; /* allow for xattr and failure prefix */
|
2010-05-29 00:45:14 +04:00
|
|
|
sa_handle_t *hdl;
|
|
|
|
uint64_t xattr, rdev, gen;
|
|
|
|
uint64_t uid, gid, mode, fsize, parent, links;
|
|
|
|
uint64_t pflags;
|
|
|
|
uint64_t acctm[2], modtm[2], chgtm[2], crtm[2];
|
|
|
|
time_t z_crtime, z_atime, z_mtime, z_ctime;
|
|
|
|
sa_bulk_attr_t bulk[12];
|
|
|
|
int idx = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error;
|
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
VERIFY3P(os, ==, sa_os);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (sa_handle_get(os, object, NULL, SA_HDL_PRIVATE, &hdl)) {
|
|
|
|
(void) printf("Failed to get handle for SA znode\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_UID], NULL, &uid, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_GID], NULL, &gid, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_LINKS], NULL,
|
|
|
|
&links, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_GEN], NULL, &gen, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_MODE], NULL,
|
|
|
|
&mode, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_PARENT],
|
|
|
|
NULL, &parent, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_SIZE], NULL,
|
|
|
|
&fsize, 8);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_ATIME], NULL,
|
|
|
|
acctm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_MTIME], NULL,
|
|
|
|
modtm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_CRTIME], NULL,
|
|
|
|
crtm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_CTIME], NULL,
|
|
|
|
chgtm, 16);
|
|
|
|
SA_ADD_BULK_ATTR(bulk, idx, sa_attr_table[ZPL_FLAGS], NULL,
|
|
|
|
&pflags, 8);
|
|
|
|
|
|
|
|
if (sa_bulk_lookup(hdl, bulk, idx)) {
|
|
|
|
(void) sa_handle_destroy(hdl);
|
|
|
|
return;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
z_crtime = (time_t)crtm[0];
|
|
|
|
z_atime = (time_t)acctm[0];
|
|
|
|
z_mtime = (time_t)modtm[0];
|
|
|
|
z_ctime = (time_t)chgtm[0];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-14 00:22:32 +03:00
|
|
|
if (dump_opt['d'] > 4) {
|
|
|
|
error = zfs_obj_to_path(os, object, path, sizeof (path));
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
if (error == ESTALE) {
|
|
|
|
(void) snprintf(path, sizeof (path), "on delete queue");
|
|
|
|
} else if (error != 0) {
|
|
|
|
leaked_objects++;
|
2017-04-14 00:22:32 +03:00
|
|
|
(void) snprintf(path, sizeof (path),
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
"path not found, possibly leaked");
|
2017-04-14 00:22:32 +03:00
|
|
|
}
|
|
|
|
(void) printf("\tpath %s\n", path);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uidgid(os, uid, gid);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\tatime %s", ctime(&z_atime));
|
|
|
|
(void) printf("\tmtime %s", ctime(&z_mtime));
|
|
|
|
(void) printf("\tctime %s", ctime(&z_ctime));
|
|
|
|
(void) printf("\tcrtime %s", ctime(&z_crtime));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\tgen %llu\n", (u_longlong_t)gen);
|
|
|
|
(void) printf("\tmode %llo\n", (u_longlong_t)mode);
|
|
|
|
(void) printf("\tsize %llu\n", (u_longlong_t)fsize);
|
|
|
|
(void) printf("\tparent %llu\n", (u_longlong_t)parent);
|
|
|
|
(void) printf("\tlinks %llu\n", (u_longlong_t)links);
|
|
|
|
(void) printf("\tpflags %llx\n", (u_longlong_t)pflags);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(os) && (pflags & ZFS_PROJID)) {
|
|
|
|
uint64_t projid;
|
|
|
|
|
|
|
|
if (sa_lookup(hdl, sa_attr_table[ZPL_PROJID], &projid,
|
|
|
|
sizeof (uint64_t)) == 0)
|
|
|
|
(void) printf("\tprojid %llu\n", (u_longlong_t)projid);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
if (sa_lookup(hdl, sa_attr_table[ZPL_XATTR], &xattr,
|
|
|
|
sizeof (uint64_t)) == 0)
|
|
|
|
(void) printf("\txattr %llu\n", (u_longlong_t)xattr);
|
|
|
|
if (sa_lookup(hdl, sa_attr_table[ZPL_RDEV], &rdev,
|
|
|
|
sizeof (uint64_t)) == 0)
|
|
|
|
(void) printf("\trdev 0x%016llx\n", (u_longlong_t)rdev);
|
2013-07-09 16:15:26 +04:00
|
|
|
dump_znode_sa_xattr(hdl);
|
2010-05-29 00:45:14 +04:00
|
|
|
sa_handle_destroy(hdl);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_acl(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/*ARGSUSED*/
|
|
|
|
static void
|
|
|
|
dump_dmu_objset(objset_t *os, uint64_t object, void *data, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static object_viewer_t *object_viewer[DMU_OT_NUMTYPES + 1] = {
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_none, /* unallocated */
|
|
|
|
dump_zap, /* object directory */
|
|
|
|
dump_uint64, /* object array */
|
|
|
|
dump_none, /* packed nvlist */
|
|
|
|
dump_packed_nvlist, /* packed nvlist size */
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_none, /* bpobj */
|
|
|
|
dump_bpobj, /* bpobj header */
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_none, /* SPA space map header */
|
|
|
|
dump_none, /* SPA space map */
|
|
|
|
dump_none, /* ZIL intent log */
|
|
|
|
dump_dnode, /* DMU dnode */
|
|
|
|
dump_dmu_objset, /* DMU objset */
|
|
|
|
dump_dsl_dir, /* DSL directory */
|
|
|
|
dump_zap, /* DSL directory child map */
|
|
|
|
dump_zap, /* DSL dataset snap map */
|
|
|
|
dump_zap, /* DSL props */
|
|
|
|
dump_dsl_dataset, /* DSL dataset */
|
|
|
|
dump_znode, /* ZFS znode */
|
|
|
|
dump_acl, /* ZFS V0 ACL */
|
|
|
|
dump_uint8, /* ZFS plain file */
|
|
|
|
dump_zpldir, /* ZFS directory */
|
|
|
|
dump_zap, /* ZFS master node */
|
|
|
|
dump_zap, /* ZFS delete queue */
|
|
|
|
dump_uint8, /* zvol object */
|
|
|
|
dump_zap, /* zvol prop */
|
|
|
|
dump_uint8, /* other uint8[] */
|
|
|
|
dump_uint64, /* other uint64[] */
|
|
|
|
dump_zap, /* other ZAP */
|
|
|
|
dump_zap, /* persistent error log */
|
|
|
|
dump_uint8, /* SPA history */
|
2013-08-28 15:45:09 +04:00
|
|
|
dump_history_offsets, /* SPA history offsets */
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_zap, /* Pool properties */
|
|
|
|
dump_zap, /* DSL permissions */
|
|
|
|
dump_acl, /* ZFS ACL */
|
|
|
|
dump_uint8, /* ZFS SYSACL */
|
|
|
|
dump_none, /* FUID nvlist */
|
|
|
|
dump_packed_nvlist, /* FUID nvlist size */
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_zap, /* DSL dataset next clones */
|
|
|
|
dump_zap, /* DSL scrub queue */
|
2018-02-14 01:54:54 +03:00
|
|
|
dump_zap, /* ZFS user/group/project used */
|
|
|
|
dump_zap, /* ZFS user/group/project quota */
|
2009-08-18 22:43:27 +04:00
|
|
|
dump_zap, /* snapshot refcount tags */
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_ddt_zap, /* DDT ZAP object */
|
|
|
|
dump_zap, /* DDT statistics */
|
|
|
|
dump_znode, /* SA object */
|
|
|
|
dump_zap, /* SA Master Node */
|
|
|
|
dump_sa_attrs, /* SA attribute registration */
|
|
|
|
dump_sa_layouts, /* SA attribute layouts */
|
|
|
|
dump_zap, /* DSL scrub translations */
|
|
|
|
dump_none, /* fake dedup BP */
|
|
|
|
dump_zap, /* deadlist */
|
|
|
|
dump_none, /* deadlist hdr */
|
|
|
|
dump_zap, /* dsl clones */
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_bpobj_subobjs, /* bpobj subobjs */
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_unknown, /* Unknown type, must be last */
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
static void
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(objset_t *os, uint64_t object, int verbosity, int *print_header,
|
|
|
|
uint64_t *dnode_slots_used)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dmu_buf_t *db = NULL;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
dnode_t *dn;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t dnode_held = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
void *bonus = NULL;
|
|
|
|
size_t bsize = 0;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
char iblk[32], dblk[32], lsize[32], asize[32], fill[32], dnsize[32];
|
2010-05-29 00:45:14 +04:00
|
|
|
char bonus_size[32];
|
2008-11-20 23:01:55 +03:00
|
|
|
char aux[50];
|
|
|
|
int error;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (iblk) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (dblk) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (lsize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (asize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (bonus_size) >= NN_NUMBUF_SZ);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (*print_header) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
(void) printf("\n%10s %3s %5s %5s %5s %6s %5s %6s %s\n",
|
|
|
|
"Object", "lvl", "iblk", "dblk", "dsize", "dnsize",
|
|
|
|
"lsize", "%full", "type");
|
2008-11-20 23:01:55 +03:00
|
|
|
*print_header = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (object == 0) {
|
2010-08-27 01:24:34 +04:00
|
|
|
dn = DMU_META_DNODE(os);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_object_info_from_dnode(dn, &doi);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Encrypted datasets will have sensitive bonus buffers
|
|
|
|
* encrypted. Therefore we cannot hold the bonus buffer and
|
|
|
|
* must hold the dnode itself instead.
|
|
|
|
*/
|
|
|
|
error = dmu_object_info(os, object, &doi);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
fatal("dmu_object_info() failed, errno %u", error);
|
|
|
|
|
|
|
|
if (os->os_encrypted &&
|
|
|
|
DMU_OT_IS_ENCRYPTED(doi.doi_bonus_type)) {
|
|
|
|
error = dnode_hold(os, object, FTAG, &dn);
|
|
|
|
if (error)
|
|
|
|
fatal("dnode_hold() failed, errno %u", error);
|
|
|
|
dnode_held = B_TRUE;
|
|
|
|
} else {
|
|
|
|
error = dmu_bonus_hold(os, object, FTAG, &db);
|
|
|
|
if (error)
|
|
|
|
fatal("dmu_bonus_hold(%llu) failed, errno %u",
|
|
|
|
object, error);
|
|
|
|
bonus = db->db_data;
|
|
|
|
bsize = db->db_size;
|
|
|
|
dn = DB_DNODE((dmu_buf_impl_t *)db);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
if (dnode_slots_used)
|
|
|
|
*dnode_slots_used = doi.doi_dnodesize / DNODE_MIN_SIZE;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(doi.doi_metadata_block_size, iblk, sizeof (iblk));
|
|
|
|
zdb_nicenum(doi.doi_data_block_size, dblk, sizeof (dblk));
|
|
|
|
zdb_nicenum(doi.doi_max_offset, lsize, sizeof (lsize));
|
|
|
|
zdb_nicenum(doi.doi_physical_blocks_512 << 9, asize, sizeof (asize));
|
|
|
|
zdb_nicenum(doi.doi_bonus_size, bonus_size, sizeof (bonus_size));
|
|
|
|
zdb_nicenum(doi.doi_dnodesize, dnsize, sizeof (dnsize));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) sprintf(fill, "%6.2f", 100.0 * doi.doi_fill_count *
|
|
|
|
doi.doi_data_block_size / (object == 0 ? DNODES_PER_BLOCK : 1) /
|
|
|
|
doi.doi_max_offset);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
aux[0] = '\0';
|
|
|
|
|
|
|
|
if (doi.doi_checksum != ZIO_CHECKSUM_INHERIT || verbosity >= 6) {
|
2017-10-17 01:32:48 +03:00
|
|
|
(void) snprintf(aux + strlen(aux), sizeof (aux) - strlen(aux),
|
|
|
|
" (K=%s)", ZDB_CHECKSUM_NAME(doi.doi_checksum));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (doi.doi_compress != ZIO_COMPRESS_INHERIT || verbosity >= 6) {
|
2017-10-17 01:32:48 +03:00
|
|
|
(void) snprintf(aux + strlen(aux), sizeof (aux) - strlen(aux),
|
|
|
|
" (Z=%s)", ZDB_COMPRESS_NAME(doi.doi_compress));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
(void) printf("%10lld %3u %5s %5s %5s %6s %5s %6s %s%s\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)object, doi.doi_indirection, iblk, dblk,
|
2016-08-01 20:42:04 +03:00
|
|
|
asize, dnsize, lsize, fill, zdb_ot_name(doi.doi_type), aux);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (doi.doi_bonus_type != DMU_OT_NONE && verbosity > 3) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
(void) printf("%10s %3s %5s %5s %5s %5s %5s %6s %s\n",
|
|
|
|
"", "", "", "", "", "", bonus_size, "bonus",
|
2016-08-01 20:42:04 +03:00
|
|
|
zdb_ot_name(doi.doi_bonus_type));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (verbosity >= 4) {
|
2016-10-04 21:46:10 +03:00
|
|
|
(void) printf("\tdnode flags: %s%s%s%s\n",
|
2009-07-03 02:44:48 +04:00
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) ?
|
|
|
|
"USED_BYTES " : "",
|
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_USERUSED_ACCOUNTED) ?
|
2010-05-29 00:45:14 +04:00
|
|
|
"USERUSED_ACCOUNTED " : "",
|
2016-10-04 21:46:10 +03:00
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_USEROBJUSED_ACCOUNTED) ?
|
|
|
|
"USEROBJUSED_ACCOUNTED " : "",
|
2010-05-29 00:45:14 +04:00
|
|
|
(dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR) ?
|
|
|
|
"SPILL_BLKPTR" : "");
|
2009-07-03 02:44:48 +04:00
|
|
|
(void) printf("\tdnode maxblkid: %llu\n",
|
|
|
|
(longlong_t)dn->dn_phys->dn_maxblkid);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (!dnode_held) {
|
|
|
|
object_viewer[ZDB_OT_TYPE(doi.doi_bonus_type)](os,
|
|
|
|
object, bonus, bsize);
|
|
|
|
} else {
|
|
|
|
(void) printf("\t\t(bonus encrypted)\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!os->os_encrypted || !DMU_OT_IS_ENCRYPTED(doi.doi_type)) {
|
|
|
|
object_viewer[ZDB_OT_TYPE(doi.doi_type)](os, object,
|
|
|
|
NULL, 0);
|
|
|
|
} else {
|
|
|
|
(void) printf("\t\t(object encrypted)\n");
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
*print_header = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (verbosity >= 5)
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_indirect(dn);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (verbosity >= 5) {
|
|
|
|
/*
|
|
|
|
* Report the list of segments that comprise the object.
|
|
|
|
*/
|
|
|
|
uint64_t start = 0;
|
|
|
|
uint64_t end;
|
|
|
|
uint64_t blkfill = 1;
|
|
|
|
int minlvl = 1;
|
|
|
|
|
|
|
|
if (dn->dn_type == DMU_OT_DNODE) {
|
|
|
|
minlvl = 0;
|
|
|
|
blkfill = DNODES_PER_BLOCK;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (;;) {
|
2010-05-29 00:45:14 +04:00
|
|
|
char segsize[32];
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (segsize) >= NN_NUMBUF_SZ);
|
2008-12-03 23:09:06 +03:00
|
|
|
error = dnode_next_offset(dn,
|
|
|
|
0, &start, minlvl, blkfill, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
end = start;
|
2008-12-03 23:09:06 +03:00
|
|
|
error = dnode_next_offset(dn,
|
|
|
|
DNODE_FIND_HOLE, &end, minlvl, blkfill, 0);
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(end - start, segsize, sizeof (segsize));
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\t\tsegment [%016llx, %016llx)"
|
|
|
|
" size %5s\n", (u_longlong_t)start,
|
|
|
|
(u_longlong_t)end, segsize);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
start = end;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db != NULL)
|
|
|
|
dmu_buf_rele(db, FTAG);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (dnode_held)
|
|
|
|
dnode_rele(dn, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char *objset_types[DMU_OST_NUMTYPES] = {
|
2008-11-20 23:01:55 +03:00
|
|
|
"NONE", "META", "ZPL", "ZVOL", "OTHER", "ANY" };
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_dir(objset_t *os)
|
|
|
|
{
|
|
|
|
dmu_objset_stats_t dds;
|
|
|
|
uint64_t object, object_count;
|
|
|
|
uint64_t refdbytes, usedobjs, scratch;
|
2010-05-29 00:45:14 +04:00
|
|
|
char numbuf[32];
|
2009-07-03 02:44:48 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN + 20];
|
2016-06-16 00:28:36 +03:00
|
|
|
char osname[ZFS_MAX_DATASET_NAME_LEN];
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *type = "UNKNOWN";
|
2008-11-20 23:01:55 +03:00
|
|
|
int verbosity = dump_opt['d'];
|
|
|
|
int print_header = 1;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
|
|
|
int error;
|
2017-09-06 02:15:04 +03:00
|
|
|
uint64_t total_slots_used = 0;
|
|
|
|
uint64_t max_slot_used = 0;
|
|
|
|
uint64_t dnode_slots;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (numbuf) >= NN_NUMBUF_SZ);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_enter(dmu_objset_pool(os), FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_objset_fast_stat(os, &dds);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_exit(dmu_objset_pool(os), FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (dds.dds_type < DMU_OST_NUMTYPES)
|
|
|
|
type = objset_types[dds.dds_type];
|
|
|
|
|
|
|
|
if (dds.dds_type == DMU_OST_META) {
|
|
|
|
dds.dds_creation_txg = TXG_INITIAL;
|
2014-06-06 01:19:08 +04:00
|
|
|
usedobjs = BP_GET_FILL(os->os_rootbp);
|
2015-04-01 18:14:34 +03:00
|
|
|
refdbytes = dsl_dir_phys(os->os_spa->spa_dsl_pool->dp_mos_dir)->
|
|
|
|
dd_used_bytes;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
dmu_objset_space(os, &refdbytes, &scratch, &usedobjs, &scratch);
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(usedobjs, ==, BP_GET_FILL(os->os_rootbp));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(refdbytes, numbuf, sizeof (numbuf));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (verbosity >= 4) {
|
2013-12-09 22:37:51 +04:00
|
|
|
(void) snprintf(blkbuf, sizeof (blkbuf), ", rootbp ");
|
|
|
|
(void) snprintf_blkptr(blkbuf + strlen(blkbuf),
|
|
|
|
sizeof (blkbuf) - strlen(blkbuf), os->os_rootbp);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
blkbuf[0] = '\0';
|
|
|
|
}
|
|
|
|
|
|
|
|
dmu_objset_name(os, osname);
|
|
|
|
|
|
|
|
(void) printf("Dataset %s [%s], ID %llu, cr_txg %llu, "
|
|
|
|
"%s, %llu objects%s\n",
|
|
|
|
osname, type, (u_longlong_t)dmu_objset_id(os),
|
|
|
|
(u_longlong_t)dds.dds_creation_txg,
|
|
|
|
numbuf, (u_longlong_t)usedobjs, blkbuf);
|
|
|
|
|
|
|
|
if (zopt_objects != 0) {
|
|
|
|
for (i = 0; i < zopt_objects; i++)
|
|
|
|
dump_object(os, zopt_object[i], verbosity,
|
2017-09-06 02:15:04 +03:00
|
|
|
&print_header, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['i'] != 0 || verbosity >= 2)
|
|
|
|
dump_intent_log(dmu_objset_zil(os));
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (dmu_objset_ds(os) != NULL) {
|
|
|
|
dsl_dataset_t *ds = dmu_objset_ds(os);
|
|
|
|
dump_deadlist(&ds->ds_deadlist);
|
|
|
|
|
|
|
|
if (dsl_dataset_remap_deadlist_exists(ds)) {
|
|
|
|
(void) printf("ds_remap_deadlist:\n");
|
|
|
|
dump_deadlist(&ds->ds_remap_deadlist);
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (verbosity < 2)
|
|
|
|
return;
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (BP_IS_HOLE(os->os_rootbp))
|
2010-05-29 00:45:14 +04:00
|
|
|
return;
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, 0, verbosity, &print_header, NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
object_count = 0;
|
2010-08-27 01:24:34 +04:00
|
|
|
if (DMU_USERUSED_DNODE(os) != NULL &&
|
|
|
|
DMU_USERUSED_DNODE(os)->dn_type != 0) {
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, DMU_USERUSED_OBJECT, verbosity, &print_header,
|
|
|
|
NULL);
|
|
|
|
dump_object(os, DMU_GROUPUSED_OBJECT, verbosity, &print_header,
|
|
|
|
NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (DMU_PROJECTUSED_DNODE(os) != NULL &&
|
|
|
|
DMU_PROJECTUSED_DNODE(os)->dn_type != 0)
|
|
|
|
dump_object(os, DMU_PROJECTUSED_OBJECT, verbosity,
|
|
|
|
&print_header, NULL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
object = 0;
|
|
|
|
while ((error = dmu_object_next(os, &object, B_FALSE, 0)) == 0) {
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, object, verbosity, &print_header, &dnode_slots);
|
2008-11-20 23:01:55 +03:00
|
|
|
object_count++;
|
2017-09-06 02:15:04 +03:00
|
|
|
total_slots_used += dnode_slots;
|
|
|
|
max_slot_used = object + dnode_slots - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
(void) printf(" Dnode slots:\n");
|
|
|
|
(void) printf("\tTotal used: %10llu\n",
|
|
|
|
(u_longlong_t)total_slots_used);
|
|
|
|
(void) printf("\tMax used: %10llu\n",
|
|
|
|
(u_longlong_t)max_slot_used);
|
|
|
|
(void) printf("\tPercent empty: %10lf\n",
|
|
|
|
(double)(max_slot_used - total_slots_used)*100 /
|
|
|
|
(double)max_slot_used);
|
|
|
|
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
ASSERT3U(object_count, ==, usedobjs);
|
|
|
|
|
2017-09-06 02:15:04 +03:00
|
|
|
(void) printf("\n");
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
if (error != ESRCH) {
|
|
|
|
(void) fprintf(stderr, "dmu_object_next() = %d\n", error);
|
|
|
|
abort();
|
|
|
|
}
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
if (leaked_objects != 0) {
|
|
|
|
(void) printf("%d potentially leaked objects detected\n",
|
|
|
|
leaked_objects);
|
|
|
|
leaked_objects = 0;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uberblock(uberblock_t *ub, const char *header, const char *footer)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
time_t timestamp = ub->ub_timestamp;
|
|
|
|
|
2010-08-26 20:52:40 +04:00
|
|
|
(void) printf("%s", header ? header : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\tmagic = %016llx\n", (u_longlong_t)ub->ub_magic);
|
|
|
|
(void) printf("\tversion = %llu\n", (u_longlong_t)ub->ub_version);
|
|
|
|
(void) printf("\ttxg = %llu\n", (u_longlong_t)ub->ub_txg);
|
|
|
|
(void) printf("\tguid_sum = %llu\n", (u_longlong_t)ub->ub_guid_sum);
|
|
|
|
(void) printf("\ttimestamp = %llu UTC = %s",
|
|
|
|
(u_longlong_t)ub->ub_timestamp, asctime(localtime(×tamp)));
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
|
|
|
(void) printf("\tmmp_magic = %016llx\n",
|
|
|
|
(u_longlong_t)ub->ub_mmp_magic);
|
|
|
|
if (ub->ub_mmp_magic == MMP_MAGIC)
|
|
|
|
(void) printf("\tmmp_delay = %0llu\n",
|
|
|
|
(u_longlong_t)ub->ub_mmp_delay);
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
if (dump_opt['u'] >= 4) {
|
2008-11-20 23:01:55 +03:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), &ub->ub_rootbp);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\trootbp = %s\n", blkbuf);
|
|
|
|
}
|
2010-08-26 20:52:40 +04:00
|
|
|
(void) printf("%s", footer ? footer : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_config(spa_t *spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_t *db;
|
|
|
|
size_t nvsize = 0;
|
|
|
|
int error = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = dmu_bonus_hold(spa->spa_meta_objset,
|
|
|
|
spa->spa_config_object, FTAG, &db);
|
|
|
|
|
|
|
|
if (error == 0) {
|
|
|
|
nvsize = *(uint64_t *)db->db_data;
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
(void) printf("\nMOS Configuration:\n");
|
|
|
|
dump_packed_nvlist(spa->spa_meta_objset,
|
|
|
|
spa->spa_config_object, (void *)&nvsize, 1);
|
|
|
|
} else {
|
|
|
|
(void) fprintf(stderr, "dmu_bonus_hold(%llu) failed, errno %d",
|
|
|
|
(u_longlong_t)spa->spa_config_object, error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_cachefile(const char *cachefile)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
struct stat64 statbuf;
|
|
|
|
char *buf;
|
|
|
|
nvlist_t *config;
|
|
|
|
|
|
|
|
if ((fd = open64(cachefile, O_RDONLY)) < 0) {
|
|
|
|
(void) printf("cannot open '%s': %s\n", cachefile,
|
|
|
|
strerror(errno));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fstat64(fd, &statbuf) != 0) {
|
|
|
|
(void) printf("failed to stat '%s': %s\n", cachefile,
|
|
|
|
strerror(errno));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((buf = malloc(statbuf.st_size)) == NULL) {
|
|
|
|
(void) fprintf(stderr, "failed to allocate %llu bytes\n",
|
|
|
|
(u_longlong_t)statbuf.st_size);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (read(fd, buf, statbuf.st_size) != statbuf.st_size) {
|
|
|
|
(void) fprintf(stderr, "failed to read %llu bytes\n",
|
|
|
|
(u_longlong_t)statbuf.st_size);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) close(fd);
|
|
|
|
|
|
|
|
if (nvlist_unpack(buf, statbuf.st_size, &config, 0) != 0) {
|
|
|
|
(void) fprintf(stderr, "failed to unpack nvlist\n");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
free(buf);
|
|
|
|
|
|
|
|
dump_nvlist(config, 0);
|
|
|
|
|
|
|
|
nvlist_free(config);
|
|
|
|
}
|
|
|
|
|
2017-02-03 01:03:48 +03:00
|
|
|
/*
|
|
|
|
* ZFS label nvlist stats
|
|
|
|
*/
|
|
|
|
typedef struct zdb_nvl_stats {
|
|
|
|
int zns_list_count;
|
|
|
|
int zns_leaf_count;
|
|
|
|
size_t zns_leaf_largest;
|
|
|
|
size_t zns_leaf_total;
|
|
|
|
nvlist_t *zns_string;
|
|
|
|
nvlist_t *zns_uint64;
|
|
|
|
nvlist_t *zns_boolean;
|
|
|
|
} zdb_nvl_stats_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
collect_nvlist_stats(nvlist_t *nvl, zdb_nvl_stats_t *stats)
|
|
|
|
{
|
|
|
|
nvlist_t *list, **array;
|
|
|
|
nvpair_t *nvp = NULL;
|
|
|
|
char *name;
|
|
|
|
uint_t i, items;
|
|
|
|
|
|
|
|
stats->zns_list_count++;
|
|
|
|
|
|
|
|
while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) {
|
|
|
|
name = nvpair_name(nvp);
|
|
|
|
|
|
|
|
switch (nvpair_type(nvp)) {
|
|
|
|
case DATA_TYPE_STRING:
|
|
|
|
fnvlist_add_string(stats->zns_string, name,
|
|
|
|
fnvpair_value_string(nvp));
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_UINT64:
|
|
|
|
fnvlist_add_uint64(stats->zns_uint64, name,
|
|
|
|
fnvpair_value_uint64(nvp));
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_BOOLEAN:
|
|
|
|
fnvlist_add_boolean(stats->zns_boolean, name);
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_NVLIST:
|
|
|
|
if (nvpair_value_nvlist(nvp, &list) == 0)
|
|
|
|
collect_nvlist_stats(list, stats);
|
|
|
|
break;
|
|
|
|
case DATA_TYPE_NVLIST_ARRAY:
|
|
|
|
if (nvpair_value_nvlist_array(nvp, &array, &items) != 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
for (i = 0; i < items; i++) {
|
|
|
|
collect_nvlist_stats(array[i], stats);
|
|
|
|
|
|
|
|
/* collect stats on leaf vdev */
|
|
|
|
if (strcmp(name, "children") == 0) {
|
|
|
|
size_t size;
|
|
|
|
|
|
|
|
(void) nvlist_size(array[i], &size,
|
|
|
|
NV_ENCODE_XDR);
|
|
|
|
stats->zns_leaf_total += size;
|
|
|
|
if (size > stats->zns_leaf_largest)
|
|
|
|
stats->zns_leaf_largest = size;
|
|
|
|
stats->zns_leaf_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
(void) printf("skip type %d!\n", (int)nvpair_type(nvp));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_nvlist_stats(nvlist_t *nvl, size_t cap)
|
|
|
|
{
|
|
|
|
zdb_nvl_stats_t stats = { 0 };
|
|
|
|
size_t size, sum = 0, total;
|
2017-02-07 20:29:47 +03:00
|
|
|
size_t noise;
|
2017-02-03 01:03:48 +03:00
|
|
|
|
|
|
|
/* requires nvlist with non-unique names for stat collection */
|
|
|
|
VERIFY0(nvlist_alloc(&stats.zns_string, 0, 0));
|
|
|
|
VERIFY0(nvlist_alloc(&stats.zns_uint64, 0, 0));
|
|
|
|
VERIFY0(nvlist_alloc(&stats.zns_boolean, 0, 0));
|
|
|
|
VERIFY0(nvlist_size(stats.zns_boolean, &noise, NV_ENCODE_XDR));
|
|
|
|
|
|
|
|
(void) printf("\n\nZFS Label NVList Config Stats:\n");
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(nvl, &total, NV_ENCODE_XDR));
|
|
|
|
(void) printf(" %d bytes used, %d bytes free (using %4.1f%%)\n\n",
|
|
|
|
(int)total, (int)(cap - total), 100.0 * total / cap);
|
|
|
|
|
|
|
|
collect_nvlist_stats(nvl, &stats);
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(stats.zns_uint64, &size, NV_ENCODE_XDR));
|
|
|
|
size -= noise;
|
|
|
|
sum += size;
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n", "integers:",
|
|
|
|
(int)fnvlist_num_pairs(stats.zns_uint64),
|
|
|
|
(int)size, 100.0 * size / total);
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(stats.zns_string, &size, NV_ENCODE_XDR));
|
|
|
|
size -= noise;
|
|
|
|
sum += size;
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n", "strings:",
|
|
|
|
(int)fnvlist_num_pairs(stats.zns_string),
|
|
|
|
(int)size, 100.0 * size / total);
|
|
|
|
|
|
|
|
VERIFY0(nvlist_size(stats.zns_boolean, &size, NV_ENCODE_XDR));
|
|
|
|
size -= noise;
|
|
|
|
sum += size;
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n", "booleans:",
|
|
|
|
(int)fnvlist_num_pairs(stats.zns_boolean),
|
|
|
|
(int)size, 100.0 * size / total);
|
|
|
|
|
|
|
|
size = total - sum; /* treat remainder as nvlist overhead */
|
|
|
|
(void) printf("%12s %4d %6d bytes (%5.2f%%)\n\n", "nvlists:",
|
|
|
|
stats.zns_list_count, (int)size, 100.0 * size / total);
|
|
|
|
|
2017-02-07 20:29:47 +03:00
|
|
|
if (stats.zns_leaf_count > 0) {
|
|
|
|
size_t average = stats.zns_leaf_total / stats.zns_leaf_count;
|
2017-02-03 01:03:48 +03:00
|
|
|
|
2017-02-07 20:29:47 +03:00
|
|
|
(void) printf("%12s %4d %6d bytes average\n", "leaf vdevs:",
|
|
|
|
stats.zns_leaf_count, (int)average);
|
|
|
|
(void) printf("%24d bytes largest\n",
|
|
|
|
(int)stats.zns_leaf_largest);
|
2017-02-03 01:03:48 +03:00
|
|
|
|
2017-02-07 20:29:47 +03:00
|
|
|
if (dump_opt['l'] >= 3 && average > 0)
|
|
|
|
(void) printf(" space for %d additional leaf vdevs\n",
|
|
|
|
(int)((cap - total) / average));
|
|
|
|
}
|
2017-02-03 01:03:48 +03:00
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
nvlist_free(stats.zns_string);
|
|
|
|
nvlist_free(stats.zns_uint64);
|
|
|
|
nvlist_free(stats.zns_boolean);
|
|
|
|
}
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
typedef struct cksum_record {
|
|
|
|
zio_cksum_t cksum;
|
|
|
|
boolean_t labels[VDEV_LABELS];
|
|
|
|
avl_node_t link;
|
|
|
|
} cksum_record_t;
|
|
|
|
|
2017-02-04 01:18:28 +03:00
|
|
|
static int
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
cksum_record_compare(const void *x1, const void *x2)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
const cksum_record_t *l = (cksum_record_t *)x1;
|
|
|
|
const cksum_record_t *r = (cksum_record_t *)x2;
|
|
|
|
int arraysize = ARRAY_SIZE(l->cksum.zc_word);
|
|
|
|
int difference;
|
|
|
|
|
|
|
|
for (int i = 0; i < arraysize; i++) {
|
|
|
|
difference = AVL_CMP(l->cksum.zc_word[i], r->cksum.zc_word[i]);
|
|
|
|
if (difference)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (difference);
|
|
|
|
}
|
|
|
|
|
|
|
|
static cksum_record_t *
|
|
|
|
cksum_record_alloc(zio_cksum_t *cksum, int l)
|
|
|
|
{
|
|
|
|
cksum_record_t *rec;
|
|
|
|
|
|
|
|
rec = umem_zalloc(sizeof (*rec), UMEM_NOFAIL);
|
|
|
|
rec->cksum = *cksum;
|
|
|
|
rec->labels[l] = B_TRUE;
|
|
|
|
|
|
|
|
return (rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
static cksum_record_t *
|
|
|
|
cksum_record_lookup(avl_tree_t *tree, zio_cksum_t *cksum)
|
|
|
|
{
|
|
|
|
cksum_record_t lookup = { .cksum = *cksum };
|
|
|
|
avl_index_t where;
|
|
|
|
|
|
|
|
return (avl_find(tree, &lookup, &where));
|
|
|
|
}
|
|
|
|
|
|
|
|
static cksum_record_t *
|
|
|
|
cksum_record_insert(avl_tree_t *tree, zio_cksum_t *cksum, int l)
|
|
|
|
{
|
|
|
|
cksum_record_t *rec;
|
|
|
|
|
|
|
|
rec = cksum_record_lookup(tree, cksum);
|
|
|
|
if (rec) {
|
|
|
|
rec->labels[l] = B_TRUE;
|
|
|
|
} else {
|
|
|
|
rec = cksum_record_alloc(cksum, l);
|
|
|
|
avl_add(tree, rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (rec);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
first_label(cksum_record_t *rec)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < VDEV_LABELS; i++)
|
|
|
|
if (rec->labels[i])
|
|
|
|
return (i);
|
|
|
|
|
|
|
|
return (-1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
print_label_numbers(char *prefix, cksum_record_t *rec)
|
|
|
|
{
|
|
|
|
printf("%s", prefix);
|
|
|
|
for (int i = 0; i < VDEV_LABELS; i++)
|
|
|
|
if (rec->labels[i] == B_TRUE)
|
|
|
|
printf("%d ", i);
|
|
|
|
printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
#define MAX_UBERBLOCK_COUNT (VDEV_UBERBLOCK_RING >> UBERBLOCK_SHIFT)
|
|
|
|
|
|
|
|
typedef struct label {
|
2008-11-20 23:01:55 +03:00
|
|
|
vdev_label_t label;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
nvlist_t *config_nv;
|
|
|
|
cksum_record_t *config;
|
|
|
|
cksum_record_t *uberblocks[MAX_UBERBLOCK_COUNT];
|
|
|
|
boolean_t header_printed;
|
|
|
|
boolean_t read_failed;
|
|
|
|
} label_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
print_label_header(label_t *label, int l)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (dump_opt['q'])
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (label->header_printed == B_TRUE)
|
|
|
|
return;
|
|
|
|
|
|
|
|
(void) printf("------------------------------------\n");
|
|
|
|
(void) printf("LABEL %d\n", l);
|
|
|
|
(void) printf("------------------------------------\n");
|
|
|
|
|
|
|
|
label->header_printed = B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_config_from_label(label_t *label, size_t buflen, int l)
|
|
|
|
{
|
|
|
|
if (dump_opt['q'])
|
|
|
|
return;
|
|
|
|
|
|
|
|
if ((dump_opt['l'] < 3) && (first_label(label->config) != l))
|
|
|
|
return;
|
|
|
|
|
|
|
|
print_label_header(label, l);
|
|
|
|
dump_nvlist(label->config_nv, 4);
|
|
|
|
print_label_numbers(" labels = ", label->config);
|
|
|
|
|
|
|
|
if (dump_opt['l'] >= 2)
|
|
|
|
dump_nvlist_stats(label->config_nv, buflen);
|
|
|
|
}
|
|
|
|
|
|
|
|
#define ZDB_MAX_UB_HEADER_SIZE 32
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_label_uberblocks(label_t *label, uint64_t ashift, int label_num)
|
|
|
|
{
|
|
|
|
|
|
|
|
vdev_t vd;
|
|
|
|
char header[ZDB_MAX_UB_HEADER_SIZE];
|
|
|
|
|
|
|
|
vd.vdev_ashift = ashift;
|
|
|
|
vd.vdev_top = &vd;
|
|
|
|
|
|
|
|
for (int i = 0; i < VDEV_UBERBLOCK_COUNT(&vd); i++) {
|
|
|
|
uint64_t uoff = VDEV_UBERBLOCK_OFFSET(&vd, i);
|
|
|
|
uberblock_t *ub = (void *)((char *)&label->label + uoff);
|
|
|
|
cksum_record_t *rec = label->uberblocks[i];
|
|
|
|
|
|
|
|
if (rec == NULL) {
|
|
|
|
if (dump_opt['u'] >= 2) {
|
|
|
|
print_label_header(label, label_num);
|
|
|
|
(void) printf(" Uberblock[%d] invalid\n", i);
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((dump_opt['u'] < 3) && (first_label(rec) != label_num))
|
|
|
|
continue;
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if ((dump_opt['u'] < 4) &&
|
|
|
|
(ub->ub_mmp_magic == MMP_MAGIC) && ub->ub_mmp_delay &&
|
|
|
|
(i >= VDEV_UBERBLOCK_COUNT(&vd) - MMP_BLOCKS_PER_LABEL))
|
|
|
|
continue;
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
print_label_header(label, label_num);
|
|
|
|
(void) snprintf(header, ZDB_MAX_UB_HEADER_SIZE,
|
|
|
|
" Uberblock[%d]\n", i);
|
|
|
|
dump_uberblock(ub, header, "");
|
|
|
|
print_label_numbers(" labels = ", rec);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
static char curpath[PATH_MAX];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate through the path components, recursively passing
|
|
|
|
* current one's obj and remaining path until we find the obj
|
|
|
|
* for the last one.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dump_path_impl(objset_t *os, uint64_t obj, char *name)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
int header = 1;
|
|
|
|
uint64_t child_obj;
|
|
|
|
char *s;
|
|
|
|
dmu_buf_t *db;
|
|
|
|
dmu_object_info_t doi;
|
|
|
|
|
|
|
|
if ((s = strchr(name, '/')) != NULL)
|
|
|
|
*s = '\0';
|
|
|
|
err = zap_lookup(os, obj, name, 8, 1, &child_obj);
|
|
|
|
|
|
|
|
(void) strlcat(curpath, name, sizeof (curpath));
|
|
|
|
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "failed to lookup %s: %s\n",
|
|
|
|
curpath, strerror(err));
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
child_obj = ZFS_DIRENT_OBJ(child_obj);
|
|
|
|
err = sa_buf_hold(os, child_obj, FTAG, &db);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"failed to get SA dbuf for obj %llu: %s\n",
|
|
|
|
(u_longlong_t)child_obj, strerror(err));
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
dmu_object_info_from_db(db, &doi);
|
|
|
|
sa_buf_rele(db, FTAG);
|
|
|
|
|
|
|
|
if (doi.doi_bonus_type != DMU_OT_SA &&
|
|
|
|
doi.doi_bonus_type != DMU_OT_ZNODE) {
|
|
|
|
(void) fprintf(stderr, "invalid bonus type %d for obj %llu\n",
|
|
|
|
doi.doi_bonus_type, (u_longlong_t)child_obj);
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dump_opt['v'] > 6) {
|
|
|
|
(void) printf("obj=%llu %s type=%d bonustype=%d\n",
|
|
|
|
(u_longlong_t)child_obj, curpath, doi.doi_type,
|
|
|
|
doi.doi_bonus_type);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) strlcat(curpath, "/", sizeof (curpath));
|
|
|
|
|
|
|
|
switch (doi.doi_type) {
|
|
|
|
case DMU_OT_DIRECTORY_CONTENTS:
|
|
|
|
if (s != NULL && *(s + 1) != '\0')
|
|
|
|
return (dump_path_impl(os, child_obj, s + 1));
|
|
|
|
/*FALLTHROUGH*/
|
|
|
|
case DMU_OT_PLAIN_FILE_CONTENTS:
|
2017-09-06 02:15:04 +03:00
|
|
|
dump_object(os, child_obj, dump_opt['v'], &header, NULL);
|
2017-04-13 19:40:56 +03:00
|
|
|
return (0);
|
|
|
|
default:
|
|
|
|
(void) fprintf(stderr, "object %llu has non-file/directory "
|
|
|
|
"type %d\n", (u_longlong_t)obj, doi.doi_type);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Dump the blocks for the object specified by path inside the dataset.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dump_path(char *ds, char *path)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
objset_t *os;
|
|
|
|
uint64_t root_obj;
|
|
|
|
|
|
|
|
err = open_objset(ds, DMU_OST_ZFS, FTAG, &os);
|
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
err = zap_lookup(os, MASTER_NODE_OBJ, ZFS_ROOT_OBJ, 8, 1, &root_obj);
|
|
|
|
if (err != 0) {
|
|
|
|
(void) fprintf(stderr, "can't lookup root znode: %s\n",
|
|
|
|
strerror(err));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_FALSE, FTAG);
|
2017-04-13 19:40:56 +03:00
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) snprintf(curpath, sizeof (curpath), "dataset=%s path=/", ds);
|
|
|
|
|
|
|
|
err = dump_path_impl(os, root_obj, path);
|
|
|
|
|
|
|
|
close_objset(os, FTAG);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
static int
|
|
|
|
dump_label(const char *dev)
|
|
|
|
{
|
2017-02-04 01:18:28 +03:00
|
|
|
char path[MAXPATHLEN];
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
label_t labels[VDEV_LABELS];
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t psize, ashift;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
struct stat64 statbuf;
|
|
|
|
boolean_t config_found = B_FALSE;
|
|
|
|
boolean_t error = B_FALSE;
|
|
|
|
avl_tree_t config_tree;
|
|
|
|
avl_tree_t uberblock_tree;
|
|
|
|
void *node, *cookie;
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
bzero(labels, sizeof (labels));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-01-13 20:25:15 +03:00
|
|
|
/*
|
|
|
|
* Check if we were given absolute path and use it as is.
|
|
|
|
* Otherwise if the provided vdev name doesn't point to a file,
|
|
|
|
* try prepending expected disk paths and partition numbers.
|
|
|
|
*/
|
2017-02-04 01:18:28 +03:00
|
|
|
(void) strlcpy(path, dev, sizeof (path));
|
2017-01-13 20:25:15 +03:00
|
|
|
if (dev[0] != '/' && stat64(path, &statbuf) != 0) {
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = zfs_resolve_shortname(dev, path, MAXPATHLEN);
|
|
|
|
if (error == 0 && zfs_dev_is_whole_disk(path)) {
|
|
|
|
if (zfs_append_partition(path, MAXPATHLEN) == -1)
|
|
|
|
error = ENOENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error || (stat64(path, &statbuf) != 0)) {
|
|
|
|
(void) printf("failed to find device %s, try "
|
|
|
|
"specifying absolute path instead\n", dev);
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if ((fd = open64(path, O_RDONLY)) < 0) {
|
|
|
|
(void) printf("cannot open '%s': %s\n", path, strerror(errno));
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
2010-12-14 20:50:37 +03:00
|
|
|
if (fstat64_blk(fd, &statbuf) != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("failed to stat '%s': %s\n", path,
|
2008-11-20 23:01:55 +03:00
|
|
|
strerror(errno));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) close(fd);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
2017-12-19 21:49:33 +03:00
|
|
|
if (S_ISBLK(statbuf.st_mode) && ioctl(fd, BLKFLSBUF) != 0)
|
|
|
|
(void) printf("failed to invalidate cache '%s' : %s\n", path,
|
|
|
|
strerror(errno));
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
avl_create(&config_tree, cksum_record_compare,
|
|
|
|
sizeof (cksum_record_t), offsetof(cksum_record_t, link));
|
|
|
|
avl_create(&uberblock_tree, cksum_record_compare,
|
|
|
|
sizeof (cksum_record_t), offsetof(cksum_record_t, link));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
psize = statbuf.st_size;
|
|
|
|
psize = P2ALIGN(psize, (uint64_t)sizeof (vdev_label_t));
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
ashift = SPA_MINBLOCKSHIFT;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
/*
|
|
|
|
* 1. Read the label from disk
|
|
|
|
* 2. Unpack the configuration and insert in config tree.
|
|
|
|
* 3. Traverse all uberblocks and insert in uberblock tree.
|
|
|
|
*/
|
|
|
|
for (int l = 0; l < VDEV_LABELS; l++) {
|
|
|
|
label_t *label = &labels[l];
|
|
|
|
char *buf = label->label.vl_vdev_phys.vp_nvlist;
|
|
|
|
size_t buflen = sizeof (label->label.vl_vdev_phys.vp_nvlist);
|
|
|
|
nvlist_t *config;
|
|
|
|
cksum_record_t *rec;
|
|
|
|
zio_cksum_t cksum;
|
|
|
|
vdev_t vd;
|
|
|
|
|
|
|
|
if (pread64(fd, &label->label, sizeof (label->label),
|
|
|
|
vdev_label_offset(psize, l, 0)) != sizeof (label->label)) {
|
2017-02-04 01:18:28 +03:00
|
|
|
if (!dump_opt['q'])
|
|
|
|
(void) printf("failed to read label %d\n", l);
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
label->read_failed = B_TRUE;
|
|
|
|
error = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
label->read_failed = B_FALSE;
|
|
|
|
|
|
|
|
if (nvlist_unpack(buf, buflen, &config, 0) == 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *vdev_tree = NULL;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
size_t size;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if ((nvlist_lookup_nvlist(config,
|
|
|
|
ZPOOL_CONFIG_VDEV_TREE, &vdev_tree) != 0) ||
|
|
|
|
(nvlist_lookup_uint64(vdev_tree,
|
|
|
|
ZPOOL_CONFIG_ASHIFT, &ashift) != 0))
|
|
|
|
ashift = SPA_MINBLOCKSHIFT;
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
|
|
|
|
if (nvlist_size(config, &size, NV_ENCODE_XDR) != 0)
|
|
|
|
size = buflen;
|
|
|
|
|
|
|
|
fletcher_4_native_varsize(buf, size, &cksum);
|
|
|
|
rec = cksum_record_insert(&config_tree, &cksum, l);
|
|
|
|
|
|
|
|
label->config = rec;
|
|
|
|
label->config_nv = config;
|
|
|
|
config_found = B_TRUE;
|
|
|
|
} else {
|
|
|
|
error = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
|
|
|
|
vd.vdev_ashift = ashift;
|
|
|
|
vd.vdev_top = &vd;
|
|
|
|
|
|
|
|
for (int i = 0; i < VDEV_UBERBLOCK_COUNT(&vd); i++) {
|
|
|
|
uint64_t uoff = VDEV_UBERBLOCK_OFFSET(&vd, i);
|
|
|
|
uberblock_t *ub = (void *)((char *)label + uoff);
|
|
|
|
|
|
|
|
if (uberblock_verify(ub))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
fletcher_4_native_varsize(ub, sizeof (*ub), &cksum);
|
|
|
|
rec = cksum_record_insert(&uberblock_tree, &cksum, l);
|
|
|
|
|
|
|
|
label->uberblocks[i] = rec;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Dump the label and uberblocks.
|
|
|
|
*/
|
|
|
|
for (int l = 0; l < VDEV_LABELS; l++) {
|
|
|
|
label_t *label = &labels[l];
|
|
|
|
size_t buflen = sizeof (label->label.vl_vdev_phys.vp_nvlist);
|
|
|
|
|
|
|
|
if (label->read_failed == B_TRUE)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (label->config_nv) {
|
|
|
|
dump_config_from_label(label, buflen, l);
|
|
|
|
} else {
|
|
|
|
if (!dump_opt['q'])
|
|
|
|
(void) printf("failed to unpack label %d\n", l);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['u'])
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
dump_label_uberblocks(label, ashift, l);
|
|
|
|
|
|
|
|
nvlist_free(label->config_nv);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
cookie = NULL;
|
|
|
|
while ((node = avl_destroy_nodes(&config_tree, &cookie)) != NULL)
|
|
|
|
umem_free(node, sizeof (cksum_record_t));
|
|
|
|
|
|
|
|
cookie = NULL;
|
|
|
|
while ((node = avl_destroy_nodes(&uberblock_tree, &cookie)) != NULL)
|
|
|
|
umem_free(node, sizeof (cksum_record_t));
|
|
|
|
|
|
|
|
avl_destroy(&config_tree);
|
|
|
|
avl_destroy(&uberblock_tree);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) close(fd);
|
2017-02-04 01:18:28 +03:00
|
|
|
|
Dump unique configurations and Uberblocks in zdb -lu
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5738
2017-03-07 03:01:45 +03:00
|
|
|
return (config_found == B_FALSE ? 2 :
|
|
|
|
(error == B_TRUE ? 1 : 0));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-07-24 19:53:55 +03:00
|
|
|
static uint64_t dataset_feature_count[SPA_FEATURES];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static uint64_t remap_deadlist_count = 0;
|
2014-11-03 23:15:08 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*ARGSUSED*/
|
|
|
|
static int
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_one_dir(const char *dsname, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
objset_t *os;
|
2015-07-24 19:53:55 +03:00
|
|
|
spa_feature_t f;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
error = open_objset(dsname, DMU_OST_ANY, FTAG, &os);
|
|
|
|
if (error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
2015-07-24 19:53:55 +03:00
|
|
|
|
|
|
|
for (f = 0; f < SPA_FEATURES; f++) {
|
|
|
|
if (!dmu_objset_ds(os)->ds_feature_inuse[f])
|
|
|
|
continue;
|
|
|
|
ASSERT(spa_feature_table[f].fi_flags &
|
|
|
|
ZFEATURE_FLAG_PER_DATASET);
|
|
|
|
dataset_feature_count[f]++;
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (dsl_dataset_remap_deadlist_exists(dmu_objset_ds(os))) {
|
|
|
|
remap_deadlist_count++;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_dir(os);
|
2017-04-13 19:40:56 +03:00
|
|
|
close_objset(os, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
fuid_table_destroy();
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Block statistics.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-11-03 23:15:08 +03:00
|
|
|
#define PSIZE_HISTO_SIZE (SPA_OLD_MAXBLOCKSIZE / SPA_MINBLOCKSIZE + 2)
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct zdb_blkstats {
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zb_asize;
|
|
|
|
uint64_t zb_lsize;
|
|
|
|
uint64_t zb_psize;
|
|
|
|
uint64_t zb_count;
|
2014-11-03 22:12:40 +03:00
|
|
|
uint64_t zb_gangs;
|
|
|
|
uint64_t zb_ditto_samevdev;
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zb_psize_histogram[PSIZE_HISTO_SIZE];
|
2008-11-20 23:01:55 +03:00
|
|
|
} zdb_blkstats_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Extended object types to report deferred frees and dedup auto-ditto blocks.
|
|
|
|
*/
|
|
|
|
#define ZDB_OT_DEFERRED (DMU_OT_NUMTYPES + 0)
|
|
|
|
#define ZDB_OT_DITTO (DMU_OT_NUMTYPES + 1)
|
2012-12-14 03:24:15 +04:00
|
|
|
#define ZDB_OT_OTHER (DMU_OT_NUMTYPES + 2)
|
|
|
|
#define ZDB_OT_TOTAL (DMU_OT_NUMTYPES + 3)
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static const char *zdb_ot_extname[] = {
|
2010-05-29 00:45:14 +04:00
|
|
|
"deferred free",
|
|
|
|
"dedup ditto",
|
2012-12-14 03:24:15 +04:00
|
|
|
"other",
|
2010-05-29 00:45:14 +04:00
|
|
|
"Total",
|
|
|
|
};
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
#define ZB_TOTAL DN_MAX_LEVELS
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
typedef struct zdb_cb {
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_blkstats_t zcb_type[ZB_TOTAL + 1][ZDB_OT_TOTAL + 1];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint64_t zcb_removing_size;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zcb_dedup_asize;
|
|
|
|
uint64_t zcb_dedup_blocks;
|
2014-06-06 01:19:08 +04:00
|
|
|
uint64_t zcb_embedded_blocks[NUM_BP_EMBEDDED_TYPES];
|
|
|
|
uint64_t zcb_embedded_histogram[NUM_BP_EMBEDDED_TYPES]
|
2016-08-04 17:23:35 +03:00
|
|
|
[BPE_PAYLOAD_SIZE + 1];
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zcb_start;
|
2017-10-27 22:46:35 +03:00
|
|
|
hrtime_t zcb_lastprint;
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t zcb_totalasize;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t zcb_errors[256];
|
|
|
|
int zcb_readfails;
|
|
|
|
int zcb_haderrors;
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *zcb_spa;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint32_t **zcb_vd_obsolete_counts;
|
2008-11-20 23:01:55 +03:00
|
|
|
} zdb_cb_t;
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_count_block(zdb_cb_t *zcb, zilog_t *zilog, const blkptr_t *bp,
|
|
|
|
dmu_object_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t refcnt = 0;
|
2010-08-26 20:52:39 +04:00
|
|
|
int i;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
ASSERT(type < ZDB_OT_TOTAL);
|
|
|
|
|
|
|
|
if (zilog && zil_bp_tree_add(zilog, bp) != 0)
|
|
|
|
return;
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (i = 0; i < 4; i++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
int l = (i < 2) ? BP_GET_LEVEL(bp) : ZB_TOTAL;
|
2010-05-29 00:45:14 +04:00
|
|
|
int t = (i & 1) ? type : ZDB_OT_TOTAL;
|
2014-11-03 22:12:40 +03:00
|
|
|
int equal;
|
2008-11-20 23:01:55 +03:00
|
|
|
zdb_blkstats_t *zb = &zcb->zcb_type[l][t];
|
|
|
|
|
|
|
|
zb->zb_asize += BP_GET_ASIZE(bp);
|
|
|
|
zb->zb_lsize += BP_GET_LSIZE(bp);
|
|
|
|
zb->zb_psize += BP_GET_PSIZE(bp);
|
|
|
|
zb->zb_count++;
|
2014-11-03 23:15:08 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The histogram is only big enough to record blocks up to
|
|
|
|
* SPA_OLD_MAXBLOCKSIZE; larger blocks go into the last,
|
|
|
|
* "other", bucket.
|
|
|
|
*/
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned idx = BP_GET_PSIZE(bp) >> SPA_MINBLOCKSHIFT;
|
2014-11-03 23:15:08 +03:00
|
|
|
idx = MIN(idx, SPA_OLD_MAXBLOCKSIZE / SPA_MINBLOCKSIZE + 1);
|
|
|
|
zb->zb_psize_histogram[idx]++;
|
2014-11-03 22:12:40 +03:00
|
|
|
|
|
|
|
zb->zb_gangs += BP_COUNT_GANG(bp);
|
|
|
|
|
|
|
|
switch (BP_GET_NDVAS(bp)) {
|
|
|
|
case 2:
|
|
|
|
if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[1]))
|
|
|
|
zb->zb_ditto_samevdev++;
|
|
|
|
break;
|
|
|
|
case 3:
|
|
|
|
equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[1])) +
|
|
|
|
(DVA_GET_VDEV(&bp->blk_dva[0]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[2])) +
|
|
|
|
(DVA_GET_VDEV(&bp->blk_dva[1]) ==
|
|
|
|
DVA_GET_VDEV(&bp->blk_dva[2]));
|
|
|
|
if (equal != 0)
|
|
|
|
zb->zb_ditto_samevdev++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_EMBEDDED(bp)) {
|
|
|
|
zcb->zcb_embedded_blocks[BPE_GET_ETYPE(bp)]++;
|
|
|
|
zcb->zcb_embedded_histogram[BPE_GET_ETYPE(bp)]
|
|
|
|
[BPE_GET_PSIZE(bp)]++;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['L'])
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (BP_GET_DEDUP(bp)) {
|
|
|
|
ddt_t *ddt;
|
|
|
|
ddt_entry_t *dde;
|
|
|
|
|
|
|
|
ddt = ddt_select(zcb->zcb_spa, bp);
|
|
|
|
ddt_enter(ddt);
|
|
|
|
dde = ddt_lookup(ddt, bp, B_FALSE);
|
|
|
|
|
|
|
|
if (dde == NULL) {
|
|
|
|
refcnt = 0;
|
|
|
|
} else {
|
|
|
|
ddt_phys_t *ddp = ddt_phys_select(dde, bp);
|
|
|
|
ddt_phys_decref(ddp);
|
|
|
|
refcnt = ddp->ddp_refcnt;
|
|
|
|
if (ddt_phys_total_refcnt(dde) == 0)
|
|
|
|
ddt_remove(ddt, dde);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_exit(ddt);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
VERIFY3U(zio_wait(zio_claim(NULL, zcb->zcb_spa,
|
|
|
|
refcnt ? 0 : spa_first_txg(zcb->zcb_spa),
|
|
|
|
bp, NULL, NULL, ZIO_FLAG_CANFAIL)), ==, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-05-03 03:36:32 +04:00
|
|
|
static void
|
|
|
|
zdb_blkptr_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
spa_t *spa = zio->io_spa;
|
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
int ioerr = zio->io_error;
|
|
|
|
zdb_cb_t *zcb = zio->io_private;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t *zb = &zio->io_bookmark;
|
2013-05-03 03:36:32 +04:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_free(zio->io_abd);
|
2013-05-03 03:36:32 +04:00
|
|
|
|
|
|
|
mutex_enter(&spa->spa_scrub_lock);
|
2017-11-16 04:27:01 +03:00
|
|
|
spa->spa_load_verify_ios--;
|
2013-05-03 03:36:32 +04:00
|
|
|
cv_broadcast(&spa->spa_scrub_io_cv);
|
|
|
|
|
|
|
|
if (ioerr && !(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
|
|
|
|
zcb->zcb_haderrors = 1;
|
|
|
|
zcb->zcb_errors[ioerr]++;
|
|
|
|
|
|
|
|
if (dump_opt['b'] >= 2)
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2013-05-03 03:36:32 +04:00
|
|
|
else
|
|
|
|
blkbuf[0] = '\0';
|
|
|
|
|
|
|
|
(void) printf("zdb_blkptr_cb: "
|
|
|
|
"Got error %d reading "
|
|
|
|
"<%llu, %llu, %lld, %llx> %s -- skipping\n",
|
|
|
|
ioerr,
|
|
|
|
(u_longlong_t)zb->zb_objset,
|
|
|
|
(u_longlong_t)zb->zb_object,
|
|
|
|
(u_longlong_t)zb->zb_level,
|
|
|
|
(u_longlong_t)zb->zb_blkid,
|
|
|
|
blkbuf);
|
|
|
|
}
|
|
|
|
mutex_exit(&spa->spa_scrub_lock);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
2013-07-03 00:26:24 +04:00
|
|
|
zdb_blkptr_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
|
2014-06-25 22:37:59 +04:00
|
|
|
const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
zdb_cb_t *zcb = arg;
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_object_type_t type;
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t is_metadata;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
if (bp == NULL)
|
|
|
|
return (0);
|
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
if (dump_opt['b'] >= 5 && bp->blk_birth > 0) {
|
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
|
|
|
(void) printf("objset %llu object %llu "
|
|
|
|
"level %lld offset 0x%llx %s\n",
|
|
|
|
(u_longlong_t)zb->zb_objset,
|
|
|
|
(u_longlong_t)zb->zb_object,
|
|
|
|
(longlong_t)zb->zb_level,
|
|
|
|
(u_longlong_t)blkid2offset(dnp, bp, zb),
|
|
|
|
blkbuf);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (BP_IS_HOLE(bp))
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
type = BP_GET_TYPE(bp);
|
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
zdb_count_block(zcb, zilog, bp,
|
|
|
|
(type & DMU_OT_NEWTYPE) ? ZDB_OT_OTHER : type);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2012-12-14 03:24:15 +04:00
|
|
|
is_metadata = (BP_GET_LEVEL(bp) != 0 || DMU_OT_IS_METADATA(type));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp) &&
|
|
|
|
(dump_opt['c'] > 1 || (dump_opt['c'] && is_metadata))) {
|
2010-05-29 00:45:14 +04:00
|
|
|
size_t size = BP_GET_PSIZE(bp);
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_t *abd = abd_alloc(size, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
int flags = ZIO_FLAG_CANFAIL | ZIO_FLAG_SCRUB | ZIO_FLAG_RAW;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* If it's an intent log block, failure is expected. */
|
|
|
|
if (zb->zb_level == ZB_ZIL_LEVEL)
|
|
|
|
flags |= ZIO_FLAG_SPECULATIVE;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-05-03 03:36:32 +04:00
|
|
|
mutex_enter(&spa->spa_scrub_lock);
|
2017-11-16 04:27:01 +03:00
|
|
|
while (spa->spa_load_verify_ios > max_inflight)
|
2013-05-03 03:36:32 +04:00
|
|
|
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
|
2017-11-16 04:27:01 +03:00
|
|
|
spa->spa_load_verify_ios++;
|
2013-05-03 03:36:32 +04:00
|
|
|
mutex_exit(&spa->spa_scrub_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_read(NULL, spa, bp, abd, size,
|
2013-05-03 03:36:32 +04:00
|
|
|
zdb_blkptr_done, zcb, ZIO_PRIORITY_ASYNC_READ, flags, zb));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
zcb->zcb_readfails = 0;
|
|
|
|
|
2015-05-15 02:41:29 +03:00
|
|
|
/* only call gethrtime() every 100 blocks */
|
|
|
|
static int iters;
|
|
|
|
if (++iters > 100)
|
|
|
|
iters = 0;
|
|
|
|
else
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (dump_opt['b'] < 5 && gethrtime() > zcb->zcb_lastprint + NANOSEC) {
|
2013-03-25 01:24:51 +04:00
|
|
|
uint64_t now = gethrtime();
|
|
|
|
char buf[10];
|
|
|
|
uint64_t bytes = zcb->zcb_type[ZB_TOTAL][ZDB_OT_TOTAL].zb_asize;
|
|
|
|
int kb_per_sec =
|
|
|
|
1 + bytes / (1 + ((now - zcb->zcb_start) / 1000 / 1000));
|
|
|
|
int sec_remaining =
|
|
|
|
(zcb->zcb_totalasize - bytes) / 1024 / kb_per_sec;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (buf) >= NN_NUMBUF_SZ);
|
|
|
|
|
2017-05-02 23:43:53 +03:00
|
|
|
zfs_nicebytes(bytes, buf, sizeof (buf));
|
2013-03-25 01:24:51 +04:00
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\r%5s completed (%4dMB/s) "
|
|
|
|
"estimated time remaining: %uhr %02umin %02usec ",
|
|
|
|
buf, kb_per_sec / 1024,
|
|
|
|
sec_remaining / 60 / 60,
|
|
|
|
sec_remaining / 60 % 60,
|
|
|
|
sec_remaining % 60);
|
|
|
|
|
|
|
|
zcb->zcb_lastprint = now;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
zdb_leak(void *arg, uint64_t start, uint64_t size)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
vdev_t *vd = arg;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
(void) printf("leaked space: vdev %llu, offset 0x%llx, size %llu\n",
|
|
|
|
(u_longlong_t)vd->vdev_id, (u_longlong_t)start, (u_longlong_t)size);
|
|
|
|
}
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
static metaslab_ops_t zdb_metaslab_ops = {
|
2014-07-20 00:19:24 +04:00
|
|
|
NULL /* alloc */
|
2010-05-29 00:45:14 +04:00
|
|
|
};
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
claim_segment_impl_cb(uint64_t inner_offset, vdev_t *vd, uint64_t offset,
|
|
|
|
uint64_t size, void *arg)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This callback was called through a remap from
|
|
|
|
* a device being removed. Therefore, the vdev that
|
|
|
|
* this callback is applied to is a concrete
|
|
|
|
* vdev.
|
|
|
|
*/
|
|
|
|
ASSERT(vdev_is_concrete(vd));
|
|
|
|
|
|
|
|
VERIFY0(metaslab_claim_impl(vd, offset, size,
|
|
|
|
spa_first_txg(vd->vdev_spa)));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
claim_segment_cb(void *arg, uint64_t offset, uint64_t size)
|
|
|
|
{
|
|
|
|
vdev_t *vd = arg;
|
|
|
|
|
|
|
|
vdev_indirect_ops.vdev_op_remap(vd, offset, size,
|
|
|
|
claim_segment_impl_cb, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After accounting for all allocated blocks that are directly referenced,
|
|
|
|
* we might have missed a reference to a block from a partially complete
|
|
|
|
* (and thus unused) indirect mapping object. We perform a secondary pass
|
|
|
|
* through the metaslabs we have already mapped and claim the destination
|
|
|
|
* blocks.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
zdb_claim_removing(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
|
|
|
if (spa->spa_vdev_removal == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
|
|
|
spa_vdev_removal_t *svr = spa->spa_vdev_removal;
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
vdev_t *vd = vdev_lookup_top(spa, svr->svr_vdev_id);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
|
|
|
|
for (uint64_t msi = 0; msi < vd->vdev_ms_count; msi++) {
|
|
|
|
metaslab_t *msp = vd->vdev_ms[msi];
|
|
|
|
|
|
|
|
if (msp->ms_start >= vdev_indirect_mapping_max_offset(vim))
|
|
|
|
break;
|
|
|
|
|
|
|
|
ASSERT0(range_tree_space(svr->svr_allocd_segs));
|
|
|
|
|
|
|
|
if (msp->ms_sm != NULL) {
|
|
|
|
VERIFY0(space_map_load(msp->ms_sm,
|
|
|
|
svr->svr_allocd_segs, SM_ALLOC));
|
|
|
|
|
|
|
|
/*
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
* Clear everything past what has been synced unless
|
|
|
|
* it's past the spacemap, because we have not allocated
|
|
|
|
* mappings for it yet.
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
*/
|
OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-02-13 22:37:56 +03:00
|
|
|
uint64_t vim_max_offset =
|
|
|
|
vdev_indirect_mapping_max_offset(vim);
|
|
|
|
uint64_t sm_end = msp->ms_sm->sm_start +
|
|
|
|
msp->ms_sm->sm_size;
|
|
|
|
if (sm_end > vim_max_offset)
|
|
|
|
range_tree_clear(svr->svr_allocd_segs,
|
|
|
|
vim_max_offset, sm_end - vim_max_offset);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
zcb->zcb_removing_size +=
|
|
|
|
range_tree_space(svr->svr_allocd_segs);
|
|
|
|
range_tree_vacate(svr->svr_allocd_segs, claim_segment_cb, vd);
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vm_idxp is an in-out parameter which (for indirect vdevs) is the
|
|
|
|
* index in vim_entries that has the first entry in this metaslab. On
|
|
|
|
* return, it will be set to the first entry after this metaslab.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
zdb_leak_init_ms(metaslab_t *msp, uint64_t *vim_idxp)
|
|
|
|
{
|
|
|
|
metaslab_group_t *mg = msp->ms_group;
|
|
|
|
vdev_t *vd = mg->mg_vd;
|
|
|
|
vdev_t *rvd = vd->vdev_spa->spa_root_vdev;
|
|
|
|
|
|
|
|
mutex_enter(&msp->ms_lock);
|
|
|
|
metaslab_unload(msp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to spend the CPU manipulating the size-ordered
|
|
|
|
* tree, so clear the range_tree ops.
|
|
|
|
*/
|
|
|
|
msp->ms_tree->rt_ops = NULL;
|
|
|
|
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"\rloading vdev %llu of %llu, metaslab %llu of %llu ...",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)rvd->vdev_children,
|
|
|
|
(longlong_t)msp->ms_id,
|
|
|
|
(longlong_t)vd->vdev_ms_count);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For leak detection, we overload the metaslab ms_tree to
|
|
|
|
* contain allocated segments instead of free segments. As a
|
|
|
|
* result, we can't use the normal metaslab_load/unload
|
|
|
|
* interfaces.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
for (; *vim_idxp < vdev_indirect_mapping_num_entries(vim);
|
|
|
|
(*vim_idxp)++) {
|
|
|
|
vdev_indirect_mapping_entry_phys_t *vimep =
|
|
|
|
&vim->vim_entries[*vim_idxp];
|
|
|
|
uint64_t ent_offset = DVA_MAPPING_GET_SRC_OFFSET(vimep);
|
|
|
|
uint64_t ent_len = DVA_GET_ASIZE(&vimep->vimep_dst);
|
|
|
|
ASSERT3U(ent_offset, >=, msp->ms_start);
|
|
|
|
if (ent_offset >= msp->ms_start + msp->ms_size)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mappings do not cross metaslab boundaries,
|
|
|
|
* because we create them by walking the metaslabs.
|
|
|
|
*/
|
|
|
|
ASSERT3U(ent_offset + ent_len, <=,
|
|
|
|
msp->ms_start + msp->ms_size);
|
|
|
|
range_tree_add(msp->ms_tree, ent_offset, ent_len);
|
|
|
|
}
|
|
|
|
} else if (msp->ms_sm != NULL) {
|
|
|
|
VERIFY0(space_map_load(msp->ms_sm, msp->ms_tree, SM_ALLOC));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!msp->ms_loaded) {
|
|
|
|
msp->ms_loaded = B_TRUE;
|
|
|
|
}
|
|
|
|
mutex_exit(&msp->ms_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
increment_indirect_mapping_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
zdb_cb_t *zcb = arg;
|
|
|
|
spa_t *spa = zcb->zcb_spa;
|
|
|
|
vdev_t *vd;
|
|
|
|
const dva_t *dva = &bp->blk_dva[0];
|
|
|
|
|
|
|
|
ASSERT(!dump_opt['L']);
|
|
|
|
ASSERT3U(BP_GET_NDVAS(bp), ==, 1);
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
vd = vdev_lookup_top(zcb->zcb_spa, DVA_GET_VDEV(dva));
|
|
|
|
ASSERT3P(vd, !=, NULL);
|
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
|
|
|
|
|
|
|
ASSERT(vd->vdev_indirect_config.vic_mapping_object != 0);
|
|
|
|
ASSERT3P(zcb->zcb_vd_obsolete_counts[vd->vdev_id], !=, NULL);
|
|
|
|
|
|
|
|
vdev_indirect_mapping_increment_obsolete_count(
|
|
|
|
vd->vdev_indirect_mapping,
|
|
|
|
DVA_GET_OFFSET(dva), DVA_GET_ASIZE(dva),
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id]);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint32_t *
|
|
|
|
zdb_load_obsolete_counts(vdev_t *vd)
|
|
|
|
{
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
|
|
|
spa_condensing_indirect_phys_t *scip =
|
|
|
|
&spa->spa_condensing_indirect_phys;
|
|
|
|
uint32_t *counts;
|
|
|
|
|
|
|
|
EQUIV(vdev_obsolete_sm_object(vd) != 0, vd->vdev_obsolete_sm != NULL);
|
|
|
|
counts = vdev_indirect_mapping_load_obsolete_counts(vim);
|
|
|
|
if (vd->vdev_obsolete_sm != NULL) {
|
|
|
|
vdev_indirect_mapping_load_obsolete_spacemap(vim, counts,
|
|
|
|
vd->vdev_obsolete_sm);
|
|
|
|
}
|
|
|
|
if (scip->scip_vdev == vd->vdev_id &&
|
|
|
|
scip->scip_prev_obsolete_sm_object != 0) {
|
|
|
|
space_map_t *prev_obsolete_sm = NULL;
|
|
|
|
VERIFY0(space_map_open(&prev_obsolete_sm, spa->spa_meta_objset,
|
|
|
|
scip->scip_prev_obsolete_sm_object, 0, vd->vdev_asize, 0));
|
|
|
|
space_map_update(prev_obsolete_sm);
|
|
|
|
vdev_indirect_mapping_load_obsolete_spacemap(vim, counts,
|
|
|
|
prev_obsolete_sm);
|
|
|
|
space_map_close(prev_obsolete_sm);
|
|
|
|
}
|
|
|
|
return (counts);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
zdb_ddt_leak_init(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
ddt_bookmark_t ddb;
|
2010-05-29 00:45:14 +04:00
|
|
|
ddt_entry_t dde;
|
|
|
|
int error;
|
2010-08-26 20:52:39 +04:00
|
|
|
int p;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&ddb, sizeof (ddb));
|
2010-05-29 00:45:14 +04:00
|
|
|
while ((error = ddt_walk(spa, &ddb, &dde)) == 0) {
|
|
|
|
blkptr_t blk;
|
|
|
|
ddt_phys_t *ddp = dde.dde_phys;
|
|
|
|
|
|
|
|
if (ddb.ddb_class == DDT_CLASS_UNIQUE)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(ddt_phys_total_refcnt(&dde) > 1);
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ddp->ddp_phys_birth == 0)
|
|
|
|
continue;
|
|
|
|
ddt_bp_create(ddb.ddb_checksum,
|
|
|
|
&dde.dde_key, ddp, &blk);
|
|
|
|
if (p == DDT_PHYS_DITTO) {
|
|
|
|
zdb_count_block(zcb, NULL, &blk, ZDB_OT_DITTO);
|
|
|
|
} else {
|
|
|
|
zcb->zcb_dedup_asize +=
|
|
|
|
BP_GET_ASIZE(&blk) * (ddp->ddp_refcnt - 1);
|
|
|
|
zcb->zcb_dedup_blocks++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!dump_opt['L']) {
|
|
|
|
ddt_t *ddt = spa->spa_ddt[ddb.ddb_checksum];
|
|
|
|
ddt_enter(ddt);
|
|
|
|
VERIFY(ddt_lookup(ddt, &blk, B_TRUE) != NULL);
|
|
|
|
ddt_exit(ddt);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(error == ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_leak_init(spa_t *spa, zdb_cb_t *zcb)
|
|
|
|
{
|
|
|
|
zcb->zcb_spa = spa;
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint64_t c;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (!dump_opt['L']) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2017-01-12 22:52:56 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We are going to be changing the meaning of the metaslab's
|
|
|
|
* ms_tree. Ensure that the allocator doesn't try to
|
|
|
|
* use the tree.
|
|
|
|
*/
|
|
|
|
spa->spa_normal_class->mc_ops = &zdb_metaslab_ops;
|
|
|
|
spa->spa_log_class->mc_ops = &zdb_metaslab_ops;
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
zcb->zcb_vd_obsolete_counts =
|
|
|
|
umem_zalloc(rvd->vdev_children * sizeof (uint32_t *),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
|
2010-08-26 20:52:39 +04:00
|
|
|
for (c = 0; c < rvd->vdev_children; c++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
uint64_t vim_idx = 0;
|
|
|
|
|
|
|
|
ASSERT3U(c, ==, vd->vdev_id);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: we don't check for mapping leaks on
|
|
|
|
* removing vdevs because their ms_tree's are
|
|
|
|
* used to look for leaks in allocated space.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
zcb->zcb_vd_obsolete_counts[c] =
|
|
|
|
zdb_load_obsolete_counts(vd);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
|
|
|
/*
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* Normally, indirect vdevs don't have any
|
|
|
|
* metaslabs. We want to set them up for
|
|
|
|
* zio_claim().
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
VERIFY0(vdev_metaslab_init(vd, 0));
|
|
|
|
}
|
|
|
|
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
|
|
|
zdb_leak_init_ms(vd->vdev_ms[m], &vim_idx);
|
|
|
|
}
|
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
ASSERT3U(vim_idx, ==,
|
|
|
|
vdev_indirect_mapping_num_entries(
|
|
|
|
vd->vdev_indirect_mapping));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
2014-09-17 00:24:48 +04:00
|
|
|
(void) fprintf(stderr, "\n");
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
if (bpobj_is_open(&dp->dp_obsolete_bpobj)) {
|
|
|
|
ASSERT(spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_DEVICE_REMOVAL));
|
|
|
|
(void) bpobj_iterate_nofree(&dp->dp_obsolete_bpobj,
|
|
|
|
increment_indirect_mapping_cb, zcb, NULL);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
|
|
|
zdb_ddt_leak_init(spa, zcb);
|
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static boolean_t
|
|
|
|
zdb_check_for_obsolete_leaks(vdev_t *vd, zdb_cb_t *zcb)
|
|
|
|
{
|
|
|
|
boolean_t leaks = B_FALSE;
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
uint64_t total_leaked = 0;
|
|
|
|
|
|
|
|
ASSERT(vim != NULL);
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < vdev_indirect_mapping_num_entries(vim); i++) {
|
|
|
|
vdev_indirect_mapping_entry_phys_t *vimep =
|
|
|
|
&vim->vim_entries[i];
|
|
|
|
uint64_t obsolete_bytes = 0;
|
|
|
|
uint64_t offset = DVA_MAPPING_GET_SRC_OFFSET(vimep);
|
|
|
|
metaslab_t *msp = vd->vdev_ms[offset >> vd->vdev_ms_shift];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is not very efficient but it's easy to
|
|
|
|
* verify correctness.
|
|
|
|
*/
|
|
|
|
for (uint64_t inner_offset = 0;
|
|
|
|
inner_offset < DVA_GET_ASIZE(&vimep->vimep_dst);
|
|
|
|
inner_offset += 1 << vd->vdev_ashift) {
|
|
|
|
if (range_tree_contains(msp->ms_tree,
|
|
|
|
offset + inner_offset, 1 << vd->vdev_ashift)) {
|
|
|
|
obsolete_bytes += 1 << vd->vdev_ashift;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int64_t bytes_leaked = obsolete_bytes -
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id][i];
|
|
|
|
ASSERT3U(DVA_GET_ASIZE(&vimep->vimep_dst), >=,
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id][i]);
|
|
|
|
if (bytes_leaked != 0 &&
|
|
|
|
(vdev_obsolete_counts_are_precise(vd) ||
|
|
|
|
dump_opt['d'] >= 5)) {
|
|
|
|
(void) printf("obsolete indirect mapping count "
|
|
|
|
"mismatch on %llu:%llx:%llx : %llx bytes leaked\n",
|
|
|
|
(u_longlong_t)vd->vdev_id,
|
|
|
|
(u_longlong_t)DVA_MAPPING_GET_SRC_OFFSET(vimep),
|
|
|
|
(u_longlong_t)DVA_GET_ASIZE(&vimep->vimep_dst),
|
|
|
|
(u_longlong_t)bytes_leaked);
|
|
|
|
}
|
|
|
|
total_leaked += ABS(bytes_leaked);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!vdev_obsolete_counts_are_precise(vd) && total_leaked > 0) {
|
|
|
|
int pct_leaked = total_leaked * 100 /
|
|
|
|
vdev_indirect_mapping_bytes_mapped(vim);
|
|
|
|
(void) printf("cannot verify obsolete indirect mapping "
|
|
|
|
"counts of vdev %llu because precise feature was not "
|
|
|
|
"enabled when it was removed: %d%% (%llx bytes) of mapping"
|
|
|
|
"unreferenced\n",
|
|
|
|
(u_longlong_t)vd->vdev_id, pct_leaked,
|
|
|
|
(u_longlong_t)total_leaked);
|
|
|
|
} else if (total_leaked > 0) {
|
|
|
|
(void) printf("obsolete indirect mapping count mismatch "
|
|
|
|
"for vdev %llu -- %llx total bytes mismatched\n",
|
|
|
|
(u_longlong_t)vd->vdev_id,
|
|
|
|
(u_longlong_t)total_leaked);
|
|
|
|
leaks |= B_TRUE;
|
|
|
|
}
|
|
|
|
|
|
|
|
vdev_indirect_mapping_free_obsolete_counts(vim,
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id]);
|
|
|
|
zcb->zcb_vd_obsolete_counts[vd->vdev_id] = NULL;
|
|
|
|
|
|
|
|
return (leaks);
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
zdb_leak_fini(spa_t *spa, zdb_cb_t *zcb)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
boolean_t leaks = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['L']) {
|
|
|
|
vdev_t *rvd = spa->spa_root_vdev;
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned c = 0; c < rvd->vdev_children; c++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_t *vd = rvd->vdev_child[c];
|
2017-01-20 04:09:04 +03:00
|
|
|
ASSERTV(metaslab_group_t *mg = vd->vdev_mg);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
if (zcb->zcb_vd_obsolete_counts[c] != NULL) {
|
|
|
|
leaks |= zdb_check_for_obsolete_leaks(vd, zcb);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (uint64_t m = 0; m < vd->vdev_ms_count; m++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
metaslab_t *msp = vd->vdev_ms[m];
|
2017-01-12 22:52:56 +03:00
|
|
|
ASSERT3P(mg, ==, msp->ms_group);
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The ms_tree has been overloaded to
|
|
|
|
* contain allocated segments. Now that we
|
|
|
|
* finished traversing all blocks, any
|
|
|
|
* block that remains in the ms_tree
|
|
|
|
* represents an allocated block that we
|
|
|
|
* did not claim during the traversal.
|
|
|
|
* Claimed blocks would have been removed
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* from the ms_tree. For indirect vdevs,
|
|
|
|
* space remaining in the tree represents
|
|
|
|
* parts of the mapping that are not
|
|
|
|
* referenced, which is not a bug.
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (vd->vdev_ops == &vdev_indirect_ops) {
|
|
|
|
range_tree_vacate(msp->ms_tree,
|
|
|
|
NULL, NULL);
|
|
|
|
} else {
|
|
|
|
range_tree_vacate(msp->ms_tree,
|
|
|
|
zdb_leak, vd);
|
|
|
|
}
|
2017-01-12 22:52:56 +03:00
|
|
|
|
|
|
|
if (msp->ms_loaded)
|
|
|
|
msp->ms_loaded = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
umem_free(zcb->zcb_vd_obsolete_counts,
|
|
|
|
rvd->vdev_children * sizeof (uint32_t *));
|
|
|
|
zcb->zcb_vd_obsolete_counts = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
return (leaks);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
count_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
zdb_cb_t *zcb = arg;
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (dump_opt['b'] >= 5) {
|
2010-05-29 00:45:14 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("[%s] %s\n",
|
|
|
|
"deferred free", blkbuf);
|
|
|
|
}
|
|
|
|
zdb_count_block(zcb, NULL, bp, ZDB_OT_DEFERRED);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
|
|
|
dump_block_stats(spa_t *spa)
|
|
|
|
{
|
2010-08-26 20:52:41 +04:00
|
|
|
zdb_cb_t zcb;
|
2008-11-20 23:01:55 +03:00
|
|
|
zdb_blkstats_t *zb, *tzb;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t norm_alloc, norm_space, total_alloc, total_found;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int flags = TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA |
|
|
|
|
TRAVERSE_NO_DECRYPT | TRAVERSE_HARD;
|
2014-06-06 01:19:08 +04:00
|
|
|
boolean_t leaks = B_FALSE;
|
2018-02-02 02:42:41 +03:00
|
|
|
int e, c, err;
|
2014-06-06 01:19:08 +04:00
|
|
|
bp_embedded_type_t i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&zcb, sizeof (zcb));
|
2013-03-25 01:24:51 +04:00
|
|
|
(void) printf("\nTraversing all blocks %s%s%s%s%s...\n\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
(dump_opt['c'] || !dump_opt['L']) ? "to verify " : "",
|
|
|
|
(dump_opt['c'] == 1) ? "metadata " : "",
|
|
|
|
dump_opt['c'] ? "checksums " : "",
|
|
|
|
(dump_opt['c'] && !dump_opt['L']) ? "and verify " : "",
|
|
|
|
!dump_opt['L'] ? "nothing leaked " : "");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Load all space maps as SM_ALLOC maps, then traverse the pool
|
|
|
|
* claiming each block we discover. If the pool is perfectly
|
|
|
|
* consistent, the space maps will be empty when we're done.
|
|
|
|
* Anything left over is a leak; any block we can't claim (because
|
|
|
|
* it's not part of any space map) is a double allocation,
|
|
|
|
* reference to a freed block, or an unclaimed log block.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2013-11-01 23:26:11 +04:00
|
|
|
bzero(&zcb, sizeof (zdb_cb_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_leak_init(spa, &zcb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there's a deferred-free bplist, process that first.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) bpobj_iterate_nofree(&spa->spa_deferred_bpobj,
|
|
|
|
count_block_cb, &zcb, NULL);
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
|
|
|
|
(void) bpobj_iterate_nofree(&spa->spa_dsl_pool->dp_free_bpobj,
|
|
|
|
count_block_cb, &zcb, NULL);
|
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
zdb_claim_removing(spa, &zcb);
|
|
|
|
|
2013-10-08 21:13:05 +04:00
|
|
|
if (spa_feature_is_active(spa, SPA_FEATURE_ASYNC_DESTROY)) {
|
2012-12-14 03:24:15 +04:00
|
|
|
VERIFY3U(0, ==, bptree_iterate(spa->spa_meta_objset,
|
|
|
|
spa->spa_dsl_pool->dp_bptree_obj, B_FALSE, count_block_cb,
|
|
|
|
&zcb, NULL));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['c'] > 1)
|
|
|
|
flags |= TRAVERSE_PREFETCH_DATA;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
zcb.zcb_totalasize = metaslab_class_get_alloc(spa_normal_class(spa));
|
|
|
|
zcb.zcb_start = zcb.zcb_lastprint = gethrtime();
|
2018-02-02 02:42:41 +03:00
|
|
|
err = traverse_pool(spa, 0, flags, zdb_blkptr_cb, &zcb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-03 03:36:32 +04:00
|
|
|
/*
|
|
|
|
* If we've traversed the data blocks then we need to wait for those
|
|
|
|
* I/Os to complete. We leverage "The Godfather" zio to wait on
|
|
|
|
* all async I/Os to complete.
|
|
|
|
*/
|
|
|
|
if (dump_opt['c']) {
|
2014-09-17 10:59:43 +04:00
|
|
|
for (c = 0; c < max_ncpus; c++) {
|
|
|
|
(void) zio_wait(spa->spa_async_zio_root[c]);
|
|
|
|
spa->spa_async_zio_root[c] = zio_root(spa, NULL, NULL,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE |
|
|
|
|
ZIO_FLAG_GODFATHER);
|
|
|
|
}
|
2013-05-03 03:36:32 +04:00
|
|
|
}
|
|
|
|
|
2018-02-02 02:42:41 +03:00
|
|
|
/*
|
|
|
|
* Done after zio_wait() since zcb_haderrors is modified in
|
|
|
|
* zdb_blkptr_done()
|
|
|
|
*/
|
|
|
|
zcb.zcb_haderrors |= err;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zcb.zcb_haderrors) {
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) printf("\nError counts:\n\n");
|
|
|
|
(void) printf("\t%5s %s\n", "errno", "count");
|
2010-08-26 20:52:39 +04:00
|
|
|
for (e = 0; e < 256; e++) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (zcb.zcb_errors[e] != 0) {
|
|
|
|
(void) printf("\t%5d %llu\n",
|
|
|
|
e, (u_longlong_t)zcb.zcb_errors[e]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report any leaked segments.
|
|
|
|
*/
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
leaks |= zdb_leak_fini(spa, &zcb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
tzb = &zcb.zcb_type[ZB_TOTAL][ZDB_OT_TOTAL];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
norm_alloc = metaslab_class_get_alloc(spa_normal_class(spa));
|
|
|
|
norm_space = metaslab_class_get_space(spa_normal_class(spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
total_alloc = norm_alloc + metaslab_class_get_alloc(spa_log_class(spa));
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
total_found = tzb->zb_asize - zcb.zcb_dedup_asize +
|
|
|
|
zcb.zcb_removing_size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (total_found == total_alloc) {
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!dump_opt['L'])
|
|
|
|
(void) printf("\n\tNo leaks (block sum matches space"
|
|
|
|
" maps exactly)\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
(void) printf("block traversal size %llu != alloc %llu "
|
2009-01-16 00:59:39 +03:00
|
|
|
"(%s %lld)\n",
|
2010-05-29 00:45:14 +04:00
|
|
|
(u_longlong_t)total_found,
|
|
|
|
(u_longlong_t)total_alloc,
|
2009-01-16 00:59:39 +03:00
|
|
|
(dump_opt['L']) ? "unreachable" : "leaked",
|
2010-05-29 00:45:14 +04:00
|
|
|
(longlong_t)(total_alloc - total_found));
|
2014-06-06 01:19:08 +04:00
|
|
|
leaks = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (tzb->zb_count == 0)
|
|
|
|
return (2);
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
(void) printf("\tbp count: %10llu\n",
|
|
|
|
(u_longlong_t)tzb->zb_count);
|
2014-11-03 22:12:40 +03:00
|
|
|
(void) printf("\tganged count: %10llu\n",
|
|
|
|
(longlong_t)tzb->zb_gangs);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\tbp logical: %10llu avg: %6llu\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)tzb->zb_lsize,
|
|
|
|
(u_longlong_t)(tzb->zb_lsize / tzb->zb_count));
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\tbp physical: %10llu avg:"
|
|
|
|
" %6llu compression: %6.2f\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)tzb->zb_psize,
|
|
|
|
(u_longlong_t)(tzb->zb_psize / tzb->zb_count),
|
|
|
|
(double)tzb->zb_lsize / tzb->zb_psize);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\tbp allocated: %10llu avg:"
|
|
|
|
" %6llu compression: %6.2f\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(u_longlong_t)tzb->zb_asize,
|
|
|
|
(u_longlong_t)(tzb->zb_asize / tzb->zb_count),
|
|
|
|
(double)tzb->zb_lsize / tzb->zb_asize);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("\tbp deduped: %10llu ref>1:"
|
|
|
|
" %6llu deduplication: %6.2f\n",
|
|
|
|
(u_longlong_t)zcb.zcb_dedup_asize,
|
|
|
|
(u_longlong_t)zcb.zcb_dedup_blocks,
|
|
|
|
(double)zcb.zcb_dedup_asize / tzb->zb_asize + 1.0);
|
|
|
|
(void) printf("\tSPA allocated: %10llu used: %5.2f%%\n",
|
|
|
|
(u_longlong_t)norm_alloc, 100.0 * norm_alloc / norm_space);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
for (i = 0; i < NUM_BP_EMBEDDED_TYPES; i++) {
|
|
|
|
if (zcb.zcb_embedded_blocks[i] == 0)
|
|
|
|
continue;
|
|
|
|
(void) printf("\n");
|
|
|
|
(void) printf("\tadditional, non-pointer bps of type %u: "
|
|
|
|
"%10llu\n",
|
|
|
|
i, (u_longlong_t)zcb.zcb_embedded_blocks[i]);
|
|
|
|
|
|
|
|
if (dump_opt['b'] >= 3) {
|
|
|
|
(void) printf("\t number of (compressed) bytes: "
|
|
|
|
"number of bps\n");
|
|
|
|
dump_histogram(zcb.zcb_embedded_histogram[i],
|
|
|
|
sizeof (zcb.zcb_embedded_histogram[i]) /
|
|
|
|
sizeof (zcb.zcb_embedded_histogram[i][0]), 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-11-03 22:12:40 +03:00
|
|
|
if (tzb->zb_ditto_samevdev != 0) {
|
|
|
|
(void) printf("\tDittoed blocks on same vdev: %llu\n",
|
|
|
|
(longlong_t)tzb->zb_ditto_samevdev);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
for (uint64_t v = 0; v < spa->spa_root_vdev->vdev_children; v++) {
|
|
|
|
vdev_t *vd = spa->spa_root_vdev->vdev_child[v];
|
|
|
|
vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping;
|
|
|
|
|
|
|
|
if (vim == NULL) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
char mem[32];
|
|
|
|
zdb_nicenum(vdev_indirect_mapping_num_entries(vim),
|
|
|
|
mem, vdev_indirect_mapping_size(vim));
|
|
|
|
|
|
|
|
(void) printf("\tindirect vdev id %llu has %llu segments "
|
|
|
|
"(%s in memory)\n",
|
|
|
|
(longlong_t)vd->vdev_id,
|
|
|
|
(longlong_t)vdev_indirect_mapping_num_entries(vim), mem);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['b'] >= 2) {
|
|
|
|
int l, t, level;
|
|
|
|
(void) printf("\nBlocks\tLSIZE\tPSIZE\tASIZE"
|
|
|
|
"\t avg\t comp\t%%Total\tType\n");
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
for (t = 0; t <= ZDB_OT_TOTAL; t++) {
|
|
|
|
char csize[32], lsize[32], psize[32], asize[32];
|
2014-11-03 22:12:40 +03:00
|
|
|
char avg[32], gang[32];
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *typename;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
/* make sure nicenum has enough space */
|
|
|
|
CTASSERT(sizeof (csize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (lsize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (psize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (asize) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (avg) >= NN_NUMBUF_SZ);
|
|
|
|
CTASSERT(sizeof (gang) >= NN_NUMBUF_SZ);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (t < DMU_OT_NUMTYPES)
|
|
|
|
typename = dmu_ot[t].ot_name;
|
|
|
|
else
|
|
|
|
typename = zdb_ot_extname[t - DMU_OT_NUMTYPES];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (zcb.zcb_type[ZB_TOTAL][t].zb_asize == 0) {
|
|
|
|
(void) printf("%6s\t%5s\t%5s\t%5s"
|
|
|
|
"\t%5s\t%5s\t%6s\t%s\n",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
"-",
|
|
|
|
typename);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (l = ZB_TOTAL - 1; l >= -1; l--) {
|
|
|
|
level = (l == -1 ? ZB_TOTAL : l);
|
|
|
|
zb = &zcb.zcb_type[level][t];
|
|
|
|
|
|
|
|
if (zb->zb_asize == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (dump_opt['b'] < 3 && level != ZB_TOTAL)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (level == 0 && zb->zb_asize ==
|
|
|
|
zcb.zcb_type[ZB_TOTAL][t].zb_asize)
|
|
|
|
continue;
|
|
|
|
|
2017-06-13 12:16:45 +03:00
|
|
|
zdb_nicenum(zb->zb_count, csize,
|
|
|
|
sizeof (csize));
|
|
|
|
zdb_nicenum(zb->zb_lsize, lsize,
|
|
|
|
sizeof (lsize));
|
|
|
|
zdb_nicenum(zb->zb_psize, psize,
|
|
|
|
sizeof (psize));
|
|
|
|
zdb_nicenum(zb->zb_asize, asize,
|
|
|
|
sizeof (asize));
|
|
|
|
zdb_nicenum(zb->zb_asize / zb->zb_count, avg,
|
|
|
|
sizeof (avg));
|
|
|
|
zdb_nicenum(zb->zb_gangs, gang, sizeof (gang));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) printf("%6s\t%5s\t%5s\t%5s\t%5s"
|
|
|
|
"\t%5.2f\t%6.2f\t",
|
|
|
|
csize, lsize, psize, asize, avg,
|
|
|
|
(double)zb->zb_lsize / zb->zb_psize,
|
|
|
|
100.0 * zb->zb_asize / tzb->zb_asize);
|
|
|
|
|
|
|
|
if (level == ZB_TOTAL)
|
|
|
|
(void) printf("%s\n", typename);
|
|
|
|
else
|
|
|
|
(void) printf(" L%d %s\n",
|
|
|
|
level, typename);
|
2013-03-25 01:24:51 +04:00
|
|
|
|
2014-11-03 22:12:40 +03:00
|
|
|
if (dump_opt['b'] >= 3 && zb->zb_gangs > 0) {
|
|
|
|
(void) printf("\t number of ganged "
|
|
|
|
"blocks: %s\n", gang);
|
|
|
|
}
|
|
|
|
|
2013-03-25 01:24:51 +04:00
|
|
|
if (dump_opt['b'] >= 4) {
|
|
|
|
(void) printf("psize "
|
|
|
|
"(in 512-byte sectors): "
|
|
|
|
"number of blocks\n");
|
|
|
|
dump_histogram(zb->zb_psize_histogram,
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
PSIZE_HISTO_SIZE, 0);
|
2013-03-25 01:24:51 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) printf("\n");
|
|
|
|
|
|
|
|
if (leaks)
|
|
|
|
return (2);
|
|
|
|
|
|
|
|
if (zcb.zcb_haderrors)
|
|
|
|
return (3);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct zdb_ddt_entry {
|
|
|
|
ddt_key_t zdde_key;
|
|
|
|
uint64_t zdde_ref_blocks;
|
|
|
|
uint64_t zdde_ref_lsize;
|
|
|
|
uint64_t zdde_ref_psize;
|
|
|
|
uint64_t zdde_ref_dsize;
|
|
|
|
avl_node_t zdde_node;
|
|
|
|
} zdb_ddt_entry_t;
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
zdb_ddt_add_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
|
2014-06-25 22:37:59 +04:00
|
|
|
const zbookmark_phys_t *zb, const dnode_phys_t *dnp, void *arg)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
|
|
|
avl_tree_t *t = arg;
|
|
|
|
avl_index_t where;
|
|
|
|
zdb_ddt_entry_t *zdde, zdde_search;
|
|
|
|
|
2015-12-22 04:31:57 +03:00
|
|
|
if (bp == NULL || BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp))
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (dump_opt['S'] > 1 && zb->zb_level == ZB_ROOT_LEVEL) {
|
|
|
|
(void) printf("traversing objset %llu, %llu objects, "
|
|
|
|
"%lu blocks so far\n",
|
|
|
|
(u_longlong_t)zb->zb_objset,
|
2014-06-06 01:19:08 +04:00
|
|
|
(u_longlong_t)BP_GET_FILL(bp),
|
2010-05-29 00:45:14 +04:00
|
|
|
avl_numnodes(t));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (BP_IS_HOLE(bp) || BP_GET_CHECKSUM(bp) == ZIO_CHECKSUM_OFF ||
|
2012-12-14 03:24:15 +04:00
|
|
|
BP_GET_LEVEL(bp) > 0 || DMU_OT_IS_METADATA(BP_GET_TYPE(bp)))
|
2010-05-29 00:45:14 +04:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
ddt_key_fill(&zdde_search.zdde_key, bp);
|
|
|
|
|
|
|
|
zdde = avl_find(t, &zdde_search, &where);
|
|
|
|
|
|
|
|
if (zdde == NULL) {
|
|
|
|
zdde = umem_zalloc(sizeof (*zdde), UMEM_NOFAIL);
|
|
|
|
zdde->zdde_key = zdde_search.zdde_key;
|
|
|
|
avl_insert(t, zdde, where);
|
|
|
|
}
|
|
|
|
|
|
|
|
zdde->zdde_ref_blocks += 1;
|
|
|
|
zdde->zdde_ref_lsize += BP_GET_LSIZE(bp);
|
|
|
|
zdde->zdde_ref_psize += BP_GET_PSIZE(bp);
|
|
|
|
zdde->zdde_ref_dsize += bp_get_dsize_sync(spa, bp);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dump_simulated_ddt(spa_t *spa)
|
|
|
|
{
|
|
|
|
avl_tree_t t;
|
|
|
|
void *cookie = NULL;
|
|
|
|
zdb_ddt_entry_t *zdde;
|
2010-08-26 20:52:41 +04:00
|
|
|
ddt_histogram_t ddh_total;
|
|
|
|
ddt_stat_t dds_total;
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&ddh_total, sizeof (ddh_total));
|
|
|
|
bzero(&dds_total, sizeof (dds_total));
|
2010-05-29 00:45:14 +04:00
|
|
|
avl_create(&t, ddt_entry_compare,
|
|
|
|
sizeof (zdb_ddt_entry_t), offsetof(zdb_ddt_entry_t, zdde_node));
|
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
(void) traverse_pool(spa, 0, TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA |
|
|
|
|
TRAVERSE_NO_DECRYPT, zdb_ddt_add_cb, &t);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_CONFIG, FTAG);
|
|
|
|
|
|
|
|
while ((zdde = avl_destroy_nodes(&t, &cookie)) != NULL) {
|
|
|
|
ddt_stat_t dds;
|
|
|
|
uint64_t refcnt = zdde->zdde_ref_blocks;
|
|
|
|
ASSERT(refcnt != 0);
|
|
|
|
|
|
|
|
dds.dds_blocks = zdde->zdde_ref_blocks / refcnt;
|
|
|
|
dds.dds_lsize = zdde->zdde_ref_lsize / refcnt;
|
|
|
|
dds.dds_psize = zdde->zdde_ref_psize / refcnt;
|
|
|
|
dds.dds_dsize = zdde->zdde_ref_dsize / refcnt;
|
|
|
|
|
|
|
|
dds.dds_ref_blocks = zdde->zdde_ref_blocks;
|
|
|
|
dds.dds_ref_lsize = zdde->zdde_ref_lsize;
|
|
|
|
dds.dds_ref_psize = zdde->zdde_ref_psize;
|
|
|
|
dds.dds_ref_dsize = zdde->zdde_ref_dsize;
|
|
|
|
|
2014-04-16 07:40:22 +04:00
|
|
|
ddt_stat_add(&ddh_total.ddh_stat[highbit64(refcnt) - 1],
|
|
|
|
&dds, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
umem_free(zdde, sizeof (*zdde));
|
|
|
|
}
|
|
|
|
|
|
|
|
avl_destroy(&t);
|
|
|
|
|
|
|
|
ddt_histogram_stat(&dds_total, &ddh_total);
|
|
|
|
|
|
|
|
(void) printf("Simulated DDT histogram:\n");
|
|
|
|
|
|
|
|
zpool_dump_ddt(&dds_total, &ddh_total);
|
|
|
|
|
|
|
|
dump_dedup_ratio(&dds_total);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
static int
|
|
|
|
verify_device_removal_feature_counts(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t dr_feature_refcount = 0;
|
|
|
|
uint64_t oc_feature_refcount = 0;
|
|
|
|
uint64_t indirect_vdev_count = 0;
|
|
|
|
uint64_t precise_vdev_count = 0;
|
|
|
|
uint64_t obsolete_counts_object_count = 0;
|
|
|
|
uint64_t obsolete_sm_count = 0;
|
|
|
|
uint64_t obsolete_counts_count = 0;
|
|
|
|
uint64_t scip_count = 0;
|
|
|
|
uint64_t obsolete_bpobj_count = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
spa_condensing_indirect_phys_t *scip =
|
|
|
|
&spa->spa_condensing_indirect_phys;
|
|
|
|
if (scip->scip_next_mapping_object != 0) {
|
|
|
|
vdev_t *vd = spa->spa_root_vdev->vdev_child[scip->scip_vdev];
|
|
|
|
ASSERT(scip->scip_prev_obsolete_sm_object != 0);
|
|
|
|
ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops);
|
|
|
|
|
|
|
|
(void) printf("Condensing indirect vdev %llu: new mapping "
|
|
|
|
"object %llu, prev obsolete sm %llu\n",
|
|
|
|
(u_longlong_t)scip->scip_vdev,
|
|
|
|
(u_longlong_t)scip->scip_next_mapping_object,
|
|
|
|
(u_longlong_t)scip->scip_prev_obsolete_sm_object);
|
|
|
|
if (scip->scip_prev_obsolete_sm_object != 0) {
|
|
|
|
space_map_t *prev_obsolete_sm = NULL;
|
|
|
|
VERIFY0(space_map_open(&prev_obsolete_sm,
|
|
|
|
spa->spa_meta_objset,
|
|
|
|
scip->scip_prev_obsolete_sm_object,
|
|
|
|
0, vd->vdev_asize, 0));
|
|
|
|
space_map_update(prev_obsolete_sm);
|
|
|
|
dump_spacemap(spa->spa_meta_objset, prev_obsolete_sm);
|
|
|
|
(void) printf("\n");
|
|
|
|
space_map_close(prev_obsolete_sm);
|
|
|
|
}
|
|
|
|
|
|
|
|
scip_count += 2;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (uint64_t i = 0; i < spa->spa_root_vdev->vdev_children; i++) {
|
|
|
|
vdev_t *vd = spa->spa_root_vdev->vdev_child[i];
|
|
|
|
vdev_indirect_config_t *vic = &vd->vdev_indirect_config;
|
|
|
|
|
|
|
|
if (vic->vic_mapping_object != 0) {
|
|
|
|
ASSERT(vd->vdev_ops == &vdev_indirect_ops ||
|
|
|
|
vd->vdev_removing);
|
|
|
|
indirect_vdev_count++;
|
|
|
|
|
|
|
|
if (vd->vdev_indirect_mapping->vim_havecounts) {
|
|
|
|
obsolete_counts_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (vdev_obsolete_counts_are_precise(vd)) {
|
|
|
|
ASSERT(vic->vic_mapping_object != 0);
|
|
|
|
precise_vdev_count++;
|
|
|
|
}
|
|
|
|
if (vdev_obsolete_sm_object(vd) != 0) {
|
|
|
|
ASSERT(vic->vic_mapping_object != 0);
|
|
|
|
obsolete_sm_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
(void) feature_get_refcount(spa,
|
|
|
|
&spa_feature_table[SPA_FEATURE_DEVICE_REMOVAL],
|
|
|
|
&dr_feature_refcount);
|
|
|
|
(void) feature_get_refcount(spa,
|
|
|
|
&spa_feature_table[SPA_FEATURE_OBSOLETE_COUNTS],
|
|
|
|
&oc_feature_refcount);
|
|
|
|
|
|
|
|
if (dr_feature_refcount != indirect_vdev_count) {
|
|
|
|
ret = 1;
|
|
|
|
(void) printf("Number of indirect vdevs (%llu) " \
|
|
|
|
"does not match feature count (%llu)\n",
|
|
|
|
(u_longlong_t)indirect_vdev_count,
|
|
|
|
(u_longlong_t)dr_feature_refcount);
|
|
|
|
} else {
|
|
|
|
(void) printf("Verified device_removal feature refcount " \
|
|
|
|
"of %llu is correct\n",
|
|
|
|
(u_longlong_t)dr_feature_refcount);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zap_contains(spa_meta_objset(spa), DMU_POOL_DIRECTORY_OBJECT,
|
|
|
|
DMU_POOL_OBSOLETE_BPOBJ) == 0) {
|
|
|
|
obsolete_bpobj_count++;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
obsolete_counts_object_count = precise_vdev_count;
|
|
|
|
obsolete_counts_object_count += obsolete_sm_count;
|
|
|
|
obsolete_counts_object_count += obsolete_counts_count;
|
|
|
|
obsolete_counts_object_count += scip_count;
|
|
|
|
obsolete_counts_object_count += obsolete_bpobj_count;
|
|
|
|
obsolete_counts_object_count += remap_deadlist_count;
|
|
|
|
|
|
|
|
if (oc_feature_refcount != obsolete_counts_object_count) {
|
|
|
|
ret = 1;
|
|
|
|
(void) printf("Number of obsolete counts objects (%llu) " \
|
|
|
|
"does not match feature count (%llu)\n",
|
|
|
|
(u_longlong_t)obsolete_counts_object_count,
|
|
|
|
(u_longlong_t)oc_feature_refcount);
|
|
|
|
(void) printf("pv:%llu os:%llu oc:%llu sc:%llu "
|
|
|
|
"ob:%llu rd:%llu\n",
|
|
|
|
(u_longlong_t)precise_vdev_count,
|
|
|
|
(u_longlong_t)obsolete_sm_count,
|
|
|
|
(u_longlong_t)obsolete_counts_count,
|
|
|
|
(u_longlong_t)scip_count,
|
|
|
|
(u_longlong_t)obsolete_bpobj_count,
|
|
|
|
(u_longlong_t)remap_deadlist_count);
|
|
|
|
} else {
|
|
|
|
(void) printf("Verified indirect_refcount feature refcount " \
|
|
|
|
"of %llu is correct\n",
|
|
|
|
(u_longlong_t)oc_feature_refcount);
|
|
|
|
}
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
dump_zpool(spa_t *spa)
|
|
|
|
{
|
|
|
|
dsl_pool_t *dp = spa_get_dsl(spa);
|
|
|
|
int rc = 0;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['S']) {
|
|
|
|
dump_simulated_ddt(spa);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!dump_opt['e'] && dump_opt['C'] > 1) {
|
|
|
|
(void) printf("\nCached configuration:\n");
|
|
|
|
dump_nvlist(spa->spa_config, 8);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dump_opt['C'])
|
|
|
|
dump_config(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['u'])
|
2010-05-29 00:45:14 +04:00
|
|
|
dump_uberblock(&spa->spa_uberblock, "\nUberblock:\n", "\n");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['D'])
|
|
|
|
dump_all_ddts(spa);
|
|
|
|
|
|
|
|
if (dump_opt['d'] > 2 || dump_opt['m'])
|
|
|
|
dump_metaslabs(spa);
|
2014-07-20 00:19:24 +04:00
|
|
|
if (dump_opt['M'])
|
|
|
|
dump_metaslab_groups(spa);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (dump_opt['d'] || dump_opt['i']) {
|
2015-07-24 19:53:55 +03:00
|
|
|
spa_feature_t f;
|
2015-06-25 07:05:32 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_dir(dp->dp_meta_objset);
|
|
|
|
if (dump_opt['d'] >= 3) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dsl_pool_t *dp = spa->spa_dsl_pool;
|
2015-04-27 01:27:36 +03:00
|
|
|
dump_full_bpobj(&spa->spa_deferred_bpobj,
|
2013-07-05 23:37:16 +04:00
|
|
|
"Deferred frees", 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dump_full_bpobj(&dp->dp_free_bpobj,
|
2013-07-05 23:37:16 +04:00
|
|
|
"Pool snapshot frees", 0);
|
2012-12-14 03:24:15 +04:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
if (bpobj_is_open(&dp->dp_obsolete_bpobj)) {
|
|
|
|
ASSERT(spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_DEVICE_REMOVAL));
|
|
|
|
dump_full_bpobj(&dp->dp_obsolete_bpobj,
|
|
|
|
"Pool obsolete blocks", 0);
|
|
|
|
}
|
2012-12-14 03:24:15 +04:00
|
|
|
|
|
|
|
if (spa_feature_is_active(spa,
|
2013-10-08 21:13:05 +04:00
|
|
|
SPA_FEATURE_ASYNC_DESTROY)) {
|
2012-12-14 03:24:15 +04:00
|
|
|
dump_bptree(spa->spa_meta_objset,
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
dp->dp_bptree_obj,
|
2012-12-14 03:24:15 +04:00
|
|
|
"Pool dataset frees");
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_dtl(spa->spa_root_vdev, 0);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) dmu_objset_find(spa_name(spa), dump_one_dir,
|
|
|
|
NULL, DS_FIND_SNAPSHOTS | DS_FIND_CHILDREN);
|
2014-11-03 23:15:08 +03:00
|
|
|
|
2015-07-24 19:53:55 +03:00
|
|
|
for (f = 0; f < SPA_FEATURES; f++) {
|
|
|
|
uint64_t refcount;
|
|
|
|
|
|
|
|
if (!(spa_feature_table[f].fi_flags &
|
2017-01-24 19:59:08 +03:00
|
|
|
ZFEATURE_FLAG_PER_DATASET) ||
|
|
|
|
!spa_feature_is_enabled(spa, f)) {
|
2015-07-24 19:53:55 +03:00
|
|
|
ASSERT0(dataset_feature_count[f]);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (feature_get_refcount(spa, &spa_feature_table[f],
|
|
|
|
&refcount) == ENOTSUP)
|
|
|
|
continue;
|
|
|
|
if (dataset_feature_count[f] != refcount) {
|
|
|
|
(void) printf("%s feature refcount mismatch: "
|
|
|
|
"%lld datasets != %lld refcount\n",
|
|
|
|
spa_feature_table[f].fi_uname,
|
|
|
|
(longlong_t)dataset_feature_count[f],
|
2015-06-25 07:05:32 +03:00
|
|
|
(longlong_t)refcount);
|
|
|
|
rc = 2;
|
|
|
|
} else {
|
2015-07-24 19:53:55 +03:00
|
|
|
(void) printf("Verified %s feature refcount "
|
|
|
|
"of %llu is correct\n",
|
|
|
|
spa_feature_table[f].fi_uname,
|
2015-06-25 07:05:32 +03:00
|
|
|
(longlong_t)refcount);
|
|
|
|
}
|
2014-11-03 23:15:08 +03:00
|
|
|
}
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
|
|
|
|
if (rc == 0) {
|
|
|
|
rc = verify_device_removal_feature_counts(spa);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-11-03 23:15:08 +03:00
|
|
|
if (rc == 0 && (dump_opt['b'] || dump_opt['c']))
|
2008-11-20 23:01:55 +03:00
|
|
|
rc = dump_block_stats(spa);
|
|
|
|
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-02 01:25:53 +04:00
|
|
|
if (rc == 0)
|
|
|
|
rc = verify_spacemap_refcounts(spa);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (dump_opt['s'])
|
|
|
|
show_pool_stats(spa);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['h'])
|
|
|
|
dump_history(spa);
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
if (rc != 0) {
|
|
|
|
dump_debug_buffer();
|
2008-11-20 23:01:55 +03:00
|
|
|
exit(rc);
|
2017-01-28 23:16:43 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#define ZDB_FLAG_CHECKSUM 0x0001
|
|
|
|
#define ZDB_FLAG_DECOMPRESS 0x0002
|
|
|
|
#define ZDB_FLAG_BSWAP 0x0004
|
|
|
|
#define ZDB_FLAG_GBH 0x0008
|
|
|
|
#define ZDB_FLAG_INDIRECT 0x0010
|
|
|
|
#define ZDB_FLAG_PHYS 0x0020
|
|
|
|
#define ZDB_FLAG_RAW 0x0040
|
|
|
|
#define ZDB_FLAG_PRINT_BLKPTR 0x0080
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
static int flagbits[256];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_print_blkptr(blkptr_t *bp, int flags)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
char blkbuf[BP_SPRINTF_LEN];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (flags & ZDB_FLAG_BSWAP)
|
|
|
|
byteswap_uint64_array((void *)bp, sizeof (blkptr_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-12-09 22:37:51 +04:00
|
|
|
snprintf_blkptr(blkbuf, sizeof (blkbuf), bp);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("%s\n", blkbuf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_indirect(blkptr_t *bp, int nbps, int flags)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < nbps; i++)
|
|
|
|
zdb_print_blkptr(&bp[i], flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_gbh(void *buf, int flags)
|
|
|
|
{
|
|
|
|
zdb_dump_indirect((blkptr_t *)buf, SPA_GBH_NBLKPTRS, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_block_raw(void *buf, uint64_t size, int flags)
|
|
|
|
{
|
|
|
|
if (flags & ZDB_FLAG_BSWAP)
|
|
|
|
byteswap_uint64_array(buf, size);
|
2010-08-26 20:52:40 +04:00
|
|
|
VERIFY(write(fileno(stdout), buf, size) == size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
zdb_dump_block(char *label, void *buf, uint64_t size, int flags)
|
|
|
|
{
|
|
|
|
uint64_t *d = (uint64_t *)buf;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned nwords = size / sizeof (uint64_t);
|
2008-11-20 23:01:55 +03:00
|
|
|
int do_bswap = !!(flags & ZDB_FLAG_BSWAP);
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i, j;
|
|
|
|
const char *hdr;
|
|
|
|
char *c;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
|
|
|
|
if (do_bswap)
|
|
|
|
hdr = " 7 6 5 4 3 2 1 0 f e d c b a 9 8";
|
|
|
|
else
|
|
|
|
hdr = " 0 1 2 3 4 5 6 7 8 9 a b c d e f";
|
|
|
|
|
|
|
|
(void) printf("\n%s\n%6s %s 0123456789abcdef\n", label, "", hdr);
|
|
|
|
|
2015-11-21 02:47:37 +03:00
|
|
|
#ifdef _LITTLE_ENDIAN
|
2017-01-03 20:31:18 +03:00
|
|
|
/* correct the endianness */
|
2015-11-21 02:47:37 +03:00
|
|
|
do_bswap = !do_bswap;
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < nwords; i += 2) {
|
|
|
|
(void) printf("%06llx: %016llx %016llx ",
|
|
|
|
(u_longlong_t)(i * sizeof (uint64_t)),
|
|
|
|
(u_longlong_t)(do_bswap ? BSWAP_64(d[i]) : d[i]),
|
|
|
|
(u_longlong_t)(do_bswap ? BSWAP_64(d[i + 1]) : d[i + 1]));
|
|
|
|
|
|
|
|
c = (char *)&d[i];
|
|
|
|
for (j = 0; j < 2 * sizeof (uint64_t); j++)
|
|
|
|
(void) printf("%c", isprint(c[j]) ? c[j] : '.');
|
|
|
|
(void) printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There are two acceptable formats:
|
|
|
|
* leaf_name - For example: c1t0d0 or /tmp/ztest.0a
|
|
|
|
* child[.child]* - For example: 0.1.1
|
|
|
|
*
|
|
|
|
* The second form can be used to specify arbitrary vdevs anywhere
|
2017-01-03 20:31:18 +03:00
|
|
|
* in the hierarchy. For example, in a pool with a mirror of
|
2008-11-20 23:01:55 +03:00
|
|
|
* RAID-Zs, you can specify either RAID-Z vdev with 0.0 or 0.1 .
|
|
|
|
*/
|
|
|
|
static vdev_t *
|
2017-10-27 22:46:35 +03:00
|
|
|
zdb_vdev_lookup(vdev_t *vdev, const char *path)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
char *s, *p, *q;
|
2017-10-27 22:46:35 +03:00
|
|
|
unsigned i;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (vdev == NULL)
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
/* First, assume the x.x.x.x format */
|
2017-10-27 22:46:35 +03:00
|
|
|
i = strtoul(path, &s, 10);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (s == path || (s && *s != '.' && *s != '\0'))
|
|
|
|
goto name;
|
2017-10-27 22:46:35 +03:00
|
|
|
if (i >= vdev->vdev_children)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
vdev = vdev->vdev_child[i];
|
2016-10-24 23:37:38 +03:00
|
|
|
if (s && *s == '\0')
|
2008-11-20 23:01:55 +03:00
|
|
|
return (vdev);
|
|
|
|
return (zdb_vdev_lookup(vdev, s+1));
|
|
|
|
|
|
|
|
name:
|
|
|
|
for (i = 0; i < vdev->vdev_children; i++) {
|
|
|
|
vdev_t *vc = vdev->vdev_child[i];
|
|
|
|
|
|
|
|
if (vc->vdev_path == NULL) {
|
|
|
|
vc = zdb_vdev_lookup(vc, path);
|
|
|
|
if (vc == NULL)
|
|
|
|
continue;
|
|
|
|
else
|
|
|
|
return (vc);
|
|
|
|
}
|
|
|
|
|
|
|
|
p = strrchr(vc->vdev_path, '/');
|
|
|
|
p = p ? p + 1 : vc->vdev_path;
|
|
|
|
q = &vc->vdev_path[strlen(vc->vdev_path) - 2];
|
|
|
|
|
|
|
|
if (strcmp(vc->vdev_path, path) == 0)
|
|
|
|
return (vc);
|
|
|
|
if (strcmp(p, path) == 0)
|
|
|
|
return (vc);
|
|
|
|
if (strcmp(q, "s0") == 0 && strncmp(p, path, q - p) == 0)
|
|
|
|
return (vc);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read a block from a pool and print it out. The syntax of the
|
|
|
|
* block descriptor is:
|
|
|
|
*
|
|
|
|
* pool:vdev_specifier:offset:size[:flags]
|
|
|
|
*
|
|
|
|
* pool - The name of the pool you wish to read from
|
|
|
|
* vdev_specifier - Which vdev (see comment for zdb_vdev_lookup)
|
|
|
|
* offset - offset, in hex, in bytes
|
|
|
|
* size - Amount of data to read, in hex, in bytes
|
|
|
|
* flags - A string of characters specifying options
|
|
|
|
* b: Decode a blkptr at given offset within block
|
|
|
|
* *c: Calculate and display checksums
|
2010-05-29 00:45:14 +04:00
|
|
|
* d: Decompress data before dumping
|
2008-11-20 23:01:55 +03:00
|
|
|
* e: Byteswap data before dumping
|
2010-05-29 00:45:14 +04:00
|
|
|
* g: Display data as a gang block header
|
|
|
|
* i: Display as an indirect block
|
2008-11-20 23:01:55 +03:00
|
|
|
* p: Do I/O to physical offset
|
|
|
|
* r: Dump raw data to stdout
|
|
|
|
*
|
|
|
|
* * = not yet implemented
|
|
|
|
*/
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_read_block(char *thing, spa_t *spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
blkptr_t blk, *bp = &blk;
|
|
|
|
dva_t *dva = bp->blk_dva;
|
2008-11-20 23:01:55 +03:00
|
|
|
int flags = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t offset = 0, size = 0, psize = 0, lsize = 0, blkptr_offset = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_t *zio;
|
|
|
|
vdev_t *vd;
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_t *pabd;
|
|
|
|
void *lbuf, *buf;
|
2017-10-27 22:46:35 +03:00
|
|
|
const char *s, *vdev;
|
|
|
|
char *p, *dup, *flagstr;
|
2010-05-29 00:45:14 +04:00
|
|
|
int i, error;
|
2017-01-05 22:10:07 +03:00
|
|
|
boolean_t borrowed = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dup = strdup(thing);
|
|
|
|
s = strtok(dup, ":");
|
|
|
|
vdev = s ? s : "";
|
|
|
|
s = strtok(NULL, ":");
|
|
|
|
offset = strtoull(s ? s : "", NULL, 16);
|
|
|
|
s = strtok(NULL, ":");
|
|
|
|
size = strtoull(s ? s : "", NULL, 16);
|
|
|
|
s = strtok(NULL, ":");
|
2017-10-27 22:46:35 +03:00
|
|
|
if (s)
|
|
|
|
flagstr = strdup(s);
|
|
|
|
else
|
|
|
|
flagstr = strdup("");
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
s = NULL;
|
|
|
|
if (size == 0)
|
|
|
|
s = "size must not be zero";
|
|
|
|
if (!IS_P2ALIGNED(size, DEV_BSIZE))
|
|
|
|
s = "size must be a multiple of sector size";
|
|
|
|
if (!IS_P2ALIGNED(offset, DEV_BSIZE))
|
|
|
|
s = "offset must be a multiple of sector size";
|
|
|
|
if (s) {
|
|
|
|
(void) printf("Invalid block specifier: %s - %s\n", thing, s);
|
2017-10-27 22:46:35 +03:00
|
|
|
free(flagstr);
|
2008-11-20 23:01:55 +03:00
|
|
|
free(dup);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (s = strtok(flagstr, ":"); s; s = strtok(NULL, ":")) {
|
|
|
|
for (i = 0; flagstr[i]; i++) {
|
|
|
|
int bit = flagbits[(uchar_t)flagstr[i]];
|
|
|
|
|
|
|
|
if (bit == 0) {
|
|
|
|
(void) printf("***Invalid flag: %c\n",
|
|
|
|
flagstr[i]);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
flags |= bit;
|
|
|
|
|
|
|
|
/* If it's not something with an argument, keep going */
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((bit & (ZDB_FLAG_CHECKSUM |
|
2008-11-20 23:01:55 +03:00
|
|
|
ZDB_FLAG_PRINT_BLKPTR)) == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
p = &flagstr[i + 1];
|
2016-02-03 19:07:34 +03:00
|
|
|
if (bit == ZDB_FLAG_PRINT_BLKPTR) {
|
2008-11-20 23:01:55 +03:00
|
|
|
blkptr_offset = strtoull(p, &p, 16);
|
2016-02-03 19:07:34 +03:00
|
|
|
i = p - &flagstr[i + 1];
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (*p != ':' && *p != '\0') {
|
|
|
|
(void) printf("***Invalid flag arg: '%s'\n", s);
|
2017-10-27 22:46:35 +03:00
|
|
|
free(flagstr);
|
2008-11-20 23:01:55 +03:00
|
|
|
free(dup);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2017-10-27 22:46:35 +03:00
|
|
|
free(flagstr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
vd = zdb_vdev_lookup(spa->spa_root_vdev, vdev);
|
|
|
|
if (vd == NULL) {
|
|
|
|
(void) printf("***Invalid vdev: %s\n", vdev);
|
|
|
|
free(dup);
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
if (vd->vdev_path)
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, "Found vdev: %s\n",
|
|
|
|
vd->vdev_path);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) fprintf(stderr, "Found vdev type: %s\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
vd->vdev_ops->vdev_op_type);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
psize = size;
|
|
|
|
lsize = size;
|
|
|
|
|
2017-01-05 22:10:07 +03:00
|
|
|
pabd = abd_alloc_for_io(SPA_MAXBLOCKSIZE, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
lbuf = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
|
|
|
|
|
|
|
|
BP_ZERO(bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
DVA_SET_VDEV(&dva[0], vd->vdev_id);
|
|
|
|
DVA_SET_OFFSET(&dva[0], offset);
|
|
|
|
DVA_SET_GANG(&dva[0], !!(flags & ZDB_FLAG_GBH));
|
|
|
|
DVA_SET_ASIZE(&dva[0], vdev_psize_to_asize(vd, psize));
|
|
|
|
|
|
|
|
BP_SET_BIRTH(bp, TXG_INITIAL, TXG_INITIAL);
|
|
|
|
|
|
|
|
BP_SET_LSIZE(bp, lsize);
|
|
|
|
BP_SET_PSIZE(bp, psize);
|
|
|
|
BP_SET_COMPRESS(bp, ZIO_COMPRESS_OFF);
|
|
|
|
BP_SET_CHECKSUM(bp, ZIO_CHECKSUM_OFF);
|
|
|
|
BP_SET_TYPE(bp, DMU_OT_NONE);
|
|
|
|
BP_SET_LEVEL(bp, 0);
|
|
|
|
BP_SET_DEDUP(bp, 0);
|
|
|
|
BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
|
2008-11-20 23:01:55 +03:00
|
|
|
zio = zio_root(spa, NULL, NULL, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (vd == vd->vdev_top) {
|
|
|
|
/*
|
|
|
|
* Treat this as a normal block read.
|
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_read(zio, spa, bp, pabd, psize, NULL, NULL,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZIO_PRIORITY_SYNC_READ,
|
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_RAW, NULL));
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Treat this as a vdev child I/O.
|
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
zio_nowait(zio_vdev_child_io(zio, bp, vd, offset, pabd,
|
|
|
|
psize, ZIO_TYPE_READ, ZIO_PRIORITY_SYNC_READ,
|
2010-05-29 00:45:14 +04:00
|
|
|
ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_QUEUE |
|
|
|
|
ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_RAW | ZIO_FLAG_OPTIONAL,
|
|
|
|
NULL, NULL));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
error = zio_wait(zio);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_STATE, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error) {
|
|
|
|
(void) printf("Read of %s failed, error: %d\n", thing, error);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (flags & ZDB_FLAG_DECOMPRESS) {
|
|
|
|
/*
|
|
|
|
* We don't know how the data was compressed, so just try
|
|
|
|
* every decompress function at every inflated blocksize.
|
|
|
|
*/
|
|
|
|
enum zio_compress c;
|
|
|
|
void *lbuf2 = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
|
|
|
|
|
2016-06-29 23:59:51 +03:00
|
|
|
/*
|
|
|
|
* XXX - On the one hand, with SPA_MAXBLOCKSIZE at 16MB,
|
|
|
|
* this could take a while and we should let the user know
|
|
|
|
* we are not stuck. On the other hand, printing progress
|
|
|
|
* info gets old after a while. What to do?
|
|
|
|
*/
|
|
|
|
for (lsize = psize + SPA_MINBLOCKSIZE;
|
|
|
|
lsize <= SPA_MAXBLOCKSIZE; lsize += SPA_MINBLOCKSIZE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
for (c = 0; c < ZIO_COMPRESS_FUNCTIONS; c++) {
|
2018-02-02 03:19:36 +03:00
|
|
|
/*
|
|
|
|
* ZLE can easily decompress non zle stream.
|
|
|
|
* So have an option to disable it.
|
|
|
|
*/
|
|
|
|
if (c == ZIO_COMPRESS_ZLE &&
|
|
|
|
getenv("ZDB_NO_ZLE"))
|
|
|
|
continue;
|
|
|
|
|
2016-08-11 02:28:58 +03:00
|
|
|
(void) fprintf(stderr,
|
|
|
|
"Trying %05llx -> %05llx (%s)\n",
|
2016-06-29 23:59:51 +03:00
|
|
|
(u_longlong_t)psize, (u_longlong_t)lsize,
|
|
|
|
zio_compress_table[c].ci_name);
|
2018-02-02 03:19:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We randomize lbuf2, and decompress to both
|
|
|
|
* lbuf and lbuf2. This way, we will know if
|
|
|
|
* decompression fill exactly to lsize.
|
|
|
|
*/
|
|
|
|
VERIFY0(random_get_pseudo_bytes(lbuf2, lsize));
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
if (zio_decompress_data(c, pabd,
|
|
|
|
lbuf, psize, lsize) == 0 &&
|
2018-02-02 03:19:36 +03:00
|
|
|
zio_decompress_data(c, pabd,
|
2016-07-22 18:52:49 +03:00
|
|
|
lbuf2, psize, lsize) == 0 &&
|
2010-05-29 00:45:14 +04:00
|
|
|
bcmp(lbuf, lbuf2, lsize) == 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (c != ZIO_COMPRESS_FUNCTIONS)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
umem_free(lbuf2, SPA_MAXBLOCKSIZE);
|
|
|
|
|
2018-02-02 03:19:36 +03:00
|
|
|
if (lsize > SPA_MAXBLOCKSIZE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) printf("Decompress of %s failed\n", thing);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
buf = lbuf;
|
|
|
|
size = lsize;
|
|
|
|
} else {
|
|
|
|
size = psize;
|
2017-01-05 22:10:07 +03:00
|
|
|
buf = abd_borrow_buf_copy(pabd, size);
|
|
|
|
borrowed = B_TRUE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (flags & ZDB_FLAG_PRINT_BLKPTR)
|
|
|
|
zdb_print_blkptr((blkptr_t *)(void *)
|
|
|
|
((uintptr_t)buf + (uintptr_t)blkptr_offset), flags);
|
|
|
|
else if (flags & ZDB_FLAG_RAW)
|
|
|
|
zdb_dump_block_raw(buf, size, flags);
|
|
|
|
else if (flags & ZDB_FLAG_INDIRECT)
|
|
|
|
zdb_dump_indirect((blkptr_t *)buf, size / sizeof (blkptr_t),
|
|
|
|
flags);
|
|
|
|
else if (flags & ZDB_FLAG_GBH)
|
|
|
|
zdb_dump_gbh(buf, flags);
|
|
|
|
else
|
|
|
|
zdb_dump_block(thing, buf, size, flags);
|
|
|
|
|
2017-01-05 22:10:07 +03:00
|
|
|
if (borrowed)
|
|
|
|
abd_return_buf_copy(pabd, buf, size);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
out:
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_free(pabd);
|
2010-05-29 00:45:14 +04:00
|
|
|
umem_free(lbuf, SPA_MAXBLOCKSIZE);
|
2008-11-20 23:01:55 +03:00
|
|
|
free(dup);
|
|
|
|
}
|
|
|
|
|
2017-05-01 21:06:07 +03:00
|
|
|
static void
|
|
|
|
zdb_embedded_block(char *thing)
|
|
|
|
{
|
|
|
|
blkptr_t bp;
|
|
|
|
unsigned long long *words = (void *)&bp;
|
2018-02-02 03:28:11 +03:00
|
|
|
char *buf;
|
2017-05-01 21:06:07 +03:00
|
|
|
int err;
|
|
|
|
|
2018-02-02 03:28:11 +03:00
|
|
|
buf = umem_alloc(SPA_MAXBLOCKSIZE, UMEM_NOFAIL);
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
bzero(&bp, sizeof (bp));
|
2017-05-01 21:06:07 +03:00
|
|
|
err = sscanf(thing, "%llx:%llx:%llx:%llx:%llx:%llx:%llx:%llx:"
|
|
|
|
"%llx:%llx:%llx:%llx:%llx:%llx:%llx:%llx",
|
|
|
|
words + 0, words + 1, words + 2, words + 3,
|
|
|
|
words + 4, words + 5, words + 6, words + 7,
|
|
|
|
words + 8, words + 9, words + 10, words + 11,
|
|
|
|
words + 12, words + 13, words + 14, words + 15);
|
|
|
|
if (err != 16) {
|
|
|
|
(void) printf("invalid input format\n");
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
ASSERT3U(BPE_GET_LSIZE(&bp), <=, SPA_MAXBLOCKSIZE);
|
|
|
|
err = decode_embedded_bp(&bp, buf, BPE_GET_LSIZE(&bp));
|
|
|
|
if (err != 0) {
|
|
|
|
(void) printf("decode failed: %u\n", err);
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
zdb_dump_block_raw(buf, BPE_GET_LSIZE(&bp), 0);
|
2018-02-02 03:28:11 +03:00
|
|
|
umem_free(buf, SPA_MAXBLOCKSIZE);
|
2017-05-01 21:06:07 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
|
|
|
main(int argc, char **argv)
|
|
|
|
{
|
2017-10-27 22:46:35 +03:00
|
|
|
int c;
|
2008-11-20 23:01:55 +03:00
|
|
|
struct rlimit rl = { 1024, 1024 };
|
2010-05-29 00:45:14 +04:00
|
|
|
spa_t *spa = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
objset_t *os = NULL;
|
|
|
|
int dump_all = 1;
|
|
|
|
int verbose = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
int error = 0;
|
|
|
|
char **searchdirs = NULL;
|
|
|
|
int nsearch = 0;
|
2018-02-02 03:36:40 +03:00
|
|
|
char *target, *target_pool;
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *policy = NULL;
|
|
|
|
uint64_t max_txg = UINT64_MAX;
|
2014-06-08 22:10:14 +04:00
|
|
|
int flags = ZFS_IMPORT_MISSING_LOG;
|
2010-05-29 00:45:14 +04:00
|
|
|
int rewind = ZPOOL_NEVER_REWIND;
|
2013-06-24 10:45:20 +04:00
|
|
|
char *spa_config_path_env;
|
2015-05-14 20:45:56 +03:00
|
|
|
boolean_t target_is_spa = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) setrlimit(RLIMIT_NOFILE, &rl);
|
|
|
|
(void) enable_extended_FILE_stdio(-1, -1);
|
|
|
|
|
|
|
|
dprintf_setup(&argc, argv);
|
|
|
|
|
2013-06-24 10:45:20 +04:00
|
|
|
/*
|
|
|
|
* If there is an environment variable SPA_CONFIG_PATH it overrides
|
|
|
|
* default spa_config_path setting. If -U flag is specified it will
|
|
|
|
* override this environment variable settings once again.
|
|
|
|
*/
|
|
|
|
spa_config_path_env = getenv("SPA_CONFIG_PATH");
|
|
|
|
if (spa_config_path_env != NULL)
|
|
|
|
spa_config_path = spa_config_path_env;
|
|
|
|
|
2016-01-01 16:42:58 +03:00
|
|
|
while ((c = getopt(argc, argv,
|
2017-05-01 21:06:07 +03:00
|
|
|
"AbcCdDeEFGhiI:lLmMo:Op:PqRsSt:uU:vVx:X")) != -1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
switch (c) {
|
|
|
|
case 'b':
|
|
|
|
case 'c':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'C':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'd':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'D':
|
2017-05-01 21:06:07 +03:00
|
|
|
case 'E':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'G':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'h':
|
|
|
|
case 'i':
|
|
|
|
case 'l':
|
2009-07-03 02:44:48 +04:00
|
|
|
case 'm':
|
2014-07-20 00:19:24 +04:00
|
|
|
case 'M':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'O':
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'R':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 's':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'S':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'u':
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_opt[c]++;
|
|
|
|
dump_all = 0;
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'A':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'e':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'F':
|
2009-01-16 00:59:39 +03:00
|
|
|
case 'L':
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'P':
|
2017-02-04 01:18:28 +03:00
|
|
|
case 'q':
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'X':
|
2009-01-16 00:59:39 +03:00
|
|
|
dump_opt[c]++;
|
|
|
|
break;
|
2017-04-13 19:40:56 +03:00
|
|
|
/* NB: Sort single match options below. */
|
2014-07-20 00:19:24 +04:00
|
|
|
case 'I':
|
2013-05-03 03:36:32 +04:00
|
|
|
max_inflight = strtoull(optarg, NULL, 0);
|
|
|
|
if (max_inflight == 0) {
|
|
|
|
(void) fprintf(stderr, "maximum number "
|
|
|
|
"of inflight I/Os must be greater "
|
|
|
|
"than 0\n");
|
|
|
|
usage();
|
|
|
|
}
|
|
|
|
break;
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'o':
|
|
|
|
error = set_global_var(optarg);
|
|
|
|
if (error != 0)
|
|
|
|
usage();
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
case 'p':
|
2010-05-29 00:45:14 +04:00
|
|
|
if (searchdirs == NULL) {
|
|
|
|
searchdirs = umem_alloc(sizeof (char *),
|
|
|
|
UMEM_NOFAIL);
|
|
|
|
} else {
|
|
|
|
char **tmp = umem_alloc((nsearch + 1) *
|
|
|
|
sizeof (char *), UMEM_NOFAIL);
|
|
|
|
bcopy(searchdirs, tmp, nsearch *
|
|
|
|
sizeof (char *));
|
|
|
|
umem_free(searchdirs,
|
|
|
|
nsearch * sizeof (char *));
|
|
|
|
searchdirs = tmp;
|
|
|
|
}
|
|
|
|
searchdirs[nsearch++] = optarg;
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2009-01-16 00:59:39 +03:00
|
|
|
case 't':
|
2010-05-29 00:45:14 +04:00
|
|
|
max_txg = strtoull(optarg, NULL, 0);
|
|
|
|
if (max_txg < TXG_INITIAL) {
|
2009-01-16 00:59:39 +03:00
|
|
|
(void) fprintf(stderr, "incorrect txg "
|
|
|
|
"specified: %s\n", optarg);
|
|
|
|
usage();
|
|
|
|
}
|
|
|
|
break;
|
2010-05-29 00:45:14 +04:00
|
|
|
case 'U':
|
|
|
|
spa_config_path = optarg;
|
2017-05-01 21:06:07 +03:00
|
|
|
if (spa_config_path[0] != '/') {
|
|
|
|
(void) fprintf(stderr,
|
|
|
|
"cachefile must be an absolute path "
|
|
|
|
"(i.e. start with a slash)\n");
|
|
|
|
usage();
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
break;
|
2014-07-20 00:19:24 +04:00
|
|
|
case 'v':
|
|
|
|
verbose++;
|
|
|
|
break;
|
2017-04-13 19:40:56 +03:00
|
|
|
case 'V':
|
2017-04-14 00:28:46 +03:00
|
|
|
flags = ZFS_IMPORT_VERBATIM;
|
2017-04-13 19:40:56 +03:00
|
|
|
break;
|
2017-01-31 21:13:10 +03:00
|
|
|
case 'x':
|
|
|
|
vn_dumpdir = optarg;
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
default:
|
|
|
|
usage();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['e'] && searchdirs != NULL) {
|
2008-12-03 23:09:06 +03:00
|
|
|
(void) fprintf(stderr, "-p option requires use of -e\n");
|
|
|
|
usage();
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-10-24 02:26:49 +04:00
|
|
|
#if defined(_LP64)
|
2014-09-17 00:24:48 +04:00
|
|
|
/*
|
|
|
|
* ZDB does not typically re-read blocks; therefore limit the ARC
|
|
|
|
* to 256 MB, which can be used entirely for metadata.
|
|
|
|
*/
|
|
|
|
zfs_arc_max = zfs_arc_meta_limit = 256 * 1024 * 1024;
|
2014-10-24 02:26:49 +04:00
|
|
|
#endif
|
2014-09-17 00:24:48 +04:00
|
|
|
|
2015-05-15 02:41:29 +03:00
|
|
|
/*
|
|
|
|
* "zdb -c" uses checksum-verifying scrub i/os which are async reads.
|
|
|
|
* "zdb -b" uses traversal prefetch which uses async reads.
|
|
|
|
* For good performance, let several of them be active at once.
|
|
|
|
*/
|
|
|
|
zfs_vdev_async_read_max_active = 10;
|
|
|
|
|
2017-02-01 01:36:35 +03:00
|
|
|
/*
|
|
|
|
* Disable reference tracking for better performance.
|
|
|
|
*/
|
|
|
|
reference_tracking_enable = B_FALSE;
|
|
|
|
|
2018-01-31 02:25:19 +03:00
|
|
|
/*
|
|
|
|
* Do not fail spa_load when spa_load_verify fails. This is needed
|
|
|
|
* to load non-idle pools.
|
|
|
|
*/
|
|
|
|
spa_load_verify_dryrun = B_TRUE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
kernel_init(FREAD);
|
2015-05-21 00:39:52 +03:00
|
|
|
if ((g_zfs = libzfs_init()) == NULL) {
|
|
|
|
(void) fprintf(stderr, "%s", libzfs_error_init(errno));
|
2010-08-26 22:57:29 +04:00
|
|
|
return (1);
|
2015-05-21 00:39:52 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_all)
|
|
|
|
verbose = MAX(verbose, 1);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
for (c = 0; c < 256; c++) {
|
2017-05-01 21:06:07 +03:00
|
|
|
if (dump_all && strchr("AeEFlLOPRSX", c) == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
dump_opt[c] = 1;
|
|
|
|
if (dump_opt[c])
|
|
|
|
dump_opt[c] += verbose;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
aok = (dump_opt['A'] == 1) || (dump_opt['A'] > 2);
|
|
|
|
zfs_recover = (dump_opt['A'] > 1);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
argc -= optind;
|
|
|
|
argv += optind;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (argc < 2 && dump_opt['R'])
|
|
|
|
usage();
|
2017-05-01 21:06:07 +03:00
|
|
|
|
|
|
|
if (dump_opt['E']) {
|
|
|
|
if (argc != 1)
|
|
|
|
usage();
|
|
|
|
zdb_embedded_block(argv[0]);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (argc < 1) {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!dump_opt['e'] && dump_opt['C']) {
|
2008-12-03 23:09:06 +03:00
|
|
|
dump_cachefile(spa_config_path);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
usage();
|
|
|
|
}
|
|
|
|
|
2017-02-04 01:18:28 +03:00
|
|
|
if (dump_opt['l'])
|
|
|
|
return (dump_label(argv[0]));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
if (dump_opt['O']) {
|
|
|
|
if (argc != 2)
|
|
|
|
usage();
|
|
|
|
dump_opt['v'] = verbose + 3;
|
|
|
|
return (dump_path(argv[0], argv[1]));
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['X'] || dump_opt['F'])
|
|
|
|
rewind = ZPOOL_DO_REWIND |
|
|
|
|
(dump_opt['X'] ? ZPOOL_EXTREME_REWIND : 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (nvlist_alloc(&policy, NV_UNIQUE_NAME_TYPE, 0) != 0 ||
|
|
|
|
nvlist_add_uint64(policy, ZPOOL_REWIND_REQUEST_TXG, max_txg) != 0 ||
|
|
|
|
nvlist_add_uint32(policy, ZPOOL_REWIND_REQUEST, rewind) != 0)
|
|
|
|
fatal("internal error: %s", strerror(ENOMEM));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
error = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
target = argv[0];
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-02-02 03:36:40 +03:00
|
|
|
if (strpbrk(target, "/@") != NULL) {
|
|
|
|
size_t targetlen;
|
|
|
|
|
|
|
|
target_pool = strdup(target);
|
|
|
|
*strpbrk(target_pool, "/@") = '\0';
|
|
|
|
|
|
|
|
target_is_spa = B_FALSE;
|
|
|
|
targetlen = strlen(target);
|
|
|
|
if (targetlen && target[targetlen - 1] == '/')
|
|
|
|
target[targetlen - 1] = '\0';
|
|
|
|
} else {
|
|
|
|
target_pool = target;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (dump_opt['e']) {
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
importargs_t args = { 0 };
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_t *cfg = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
args.paths = nsearch;
|
|
|
|
args.path = searchdirs;
|
|
|
|
args.can_be_active = B_TRUE;
|
|
|
|
|
2018-02-02 03:36:40 +03:00
|
|
|
error = zpool_tryimport(g_zfs, target_pool, &cfg, &args);
|
|
|
|
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
if (error == 0) {
|
2018-02-02 03:36:40 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (nvlist_add_nvlist(cfg,
|
|
|
|
ZPOOL_REWIND_POLICY, policy) != 0) {
|
|
|
|
fatal("can't open '%s': %s",
|
|
|
|
target, strerror(ENOMEM));
|
|
|
|
}
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable the activity check to allow examination of
|
|
|
|
* active pools.
|
|
|
|
*/
|
|
|
|
if (dump_opt['C'] > 1) {
|
|
|
|
(void) printf("\nConfiguration for import:\n");
|
|
|
|
dump_nvlist(cfg, 8);
|
|
|
|
}
|
2018-02-02 03:36:40 +03:00
|
|
|
error = spa_import(target_pool, cfg, NULL,
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
flags | ZFS_IMPORT_SKIP_MMP);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2018-02-02 03:36:40 +03:00
|
|
|
if (target_pool != target)
|
|
|
|
free(target_pool);
|
2015-05-14 20:45:56 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (error == 0) {
|
2015-05-14 20:45:56 +03:00
|
|
|
if (target_is_spa || dump_opt['R']) {
|
Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-08 06:20:35 +03:00
|
|
|
/*
|
|
|
|
* Disable the activity check to allow examination of
|
|
|
|
* active pools.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((spa = spa_lookup(target)) != NULL) {
|
|
|
|
spa->spa_import_flags |= ZFS_IMPORT_SKIP_MMP;
|
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
error = spa_open_rewind(target, &spa, FTAG, policy,
|
|
|
|
NULL);
|
|
|
|
if (error) {
|
|
|
|
/*
|
|
|
|
* If we're missing the log device then
|
|
|
|
* try opening the pool after clearing the
|
|
|
|
* log state.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((spa = spa_lookup(target)) != NULL &&
|
|
|
|
spa->spa_log_state == SPA_LOG_MISSING) {
|
|
|
|
spa->spa_log_state = SPA_LOG_CLEAR;
|
|
|
|
error = 0;
|
|
|
|
}
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
if (!error) {
|
|
|
|
error = spa_open_rewind(target, &spa,
|
|
|
|
FTAG, policy, NULL);
|
|
|
|
}
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2017-04-13 19:40:56 +03:00
|
|
|
error = open_objset(target, DMU_OST_ANY, FTAG, &os);
|
2018-01-09 03:15:23 +03:00
|
|
|
if (error == 0)
|
|
|
|
spa = dmu_objset_spa(os);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
nvlist_free(policy);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (error)
|
2010-05-29 00:45:14 +04:00
|
|
|
fatal("can't open '%s': %s", target, strerror(error));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-01-09 03:15:23 +03:00
|
|
|
/*
|
|
|
|
* Set the pool failure mode to panic in order to prevent the pool
|
|
|
|
* from suspending. A suspended I/O will have no way to resume and
|
|
|
|
* can prevent the zdb(8) command from terminating as expected.
|
|
|
|
*/
|
|
|
|
if (spa != NULL)
|
|
|
|
spa->spa_failmode = ZIO_FAILURE_MODE_PANIC;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
argv++;
|
2010-05-29 00:45:14 +04:00
|
|
|
argc--;
|
|
|
|
if (!dump_opt['R']) {
|
|
|
|
if (argc > 0) {
|
|
|
|
zopt_objects = argc;
|
|
|
|
zopt_object = calloc(zopt_objects, sizeof (uint64_t));
|
2017-10-27 22:46:35 +03:00
|
|
|
for (unsigned i = 0; i < zopt_objects; i++) {
|
2010-05-29 00:45:14 +04:00
|
|
|
errno = 0;
|
|
|
|
zopt_object[i] = strtoull(argv[i], NULL, 0);
|
|
|
|
if (zopt_object[i] == 0 && errno != 0)
|
|
|
|
fatal("bad number %s: %s",
|
|
|
|
argv[i], strerror(errno));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2013-01-12 04:42:50 +04:00
|
|
|
if (os != NULL) {
|
|
|
|
dump_dir(os);
|
|
|
|
} else if (zopt_objects > 0 && !dump_opt['m']) {
|
|
|
|
dump_dir(spa->spa_meta_objset);
|
|
|
|
} else {
|
|
|
|
dump_zpool(spa);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
flagbits['b'] = ZDB_FLAG_PRINT_BLKPTR;
|
|
|
|
flagbits['c'] = ZDB_FLAG_CHECKSUM;
|
|
|
|
flagbits['d'] = ZDB_FLAG_DECOMPRESS;
|
|
|
|
flagbits['e'] = ZDB_FLAG_BSWAP;
|
|
|
|
flagbits['g'] = ZDB_FLAG_GBH;
|
|
|
|
flagbits['i'] = ZDB_FLAG_INDIRECT;
|
|
|
|
flagbits['p'] = ZDB_FLAG_PHYS;
|
|
|
|
flagbits['r'] = ZDB_FLAG_RAW;
|
|
|
|
|
2017-10-27 22:46:35 +03:00
|
|
|
for (int i = 0; i < argc; i++)
|
2010-05-29 00:45:14 +04:00
|
|
|
zdb_read_block(argv[i], spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-04-13 19:40:56 +03:00
|
|
|
if (os != NULL)
|
|
|
|
close_objset(os, FTAG);
|
|
|
|
else
|
|
|
|
spa_close(spa, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
fuid_table_destroy();
|
|
|
|
|
2017-01-28 23:16:43 +03:00
|
|
|
dump_debug_buffer();
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
libzfs_fini(g_zfs);
|
|
|
|
kernel_fini();
|
|
|
|
|
OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2017-07-06 20:35:20 +03:00
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|